GB32 Mat VS manual Matrix Calculations ( speed test )

Roger Cabo
God

Posts: 711

GB32 Mat VS manual Matrix Calculations ( speed test ) Jan 12, 2024 15:37:06 GMT 1

Quote

Post by Roger Cabo on Jan 12, 2024 15:37:06 GMT 1

I'v done a short test with Mat Mul VS Code MatrixMull.. with 1mio calls:

$Library "gfawinx"
$Library "UpdateRT"
UpdateRuntime      ' Patches GfaWin23.Ocx

Type m_Matrix
  m0 As Double
  m1 As Double
  m2 As Double
  m3 As Double
  m4 As Double
  m5 As Double
  m6 As Double
  m7 As Double
  m8 As Double
  m9 As Double
  m10 As Double
  m11 As Double
  m12 As Double
  m13 As Double
  m14 As Double
  m15 As Double
EndType

$StepOff // let this be real! :)
Dim m1 As m_Matrix
Dim m2 As m_Matrix
Dim m3 As m_Matrix

Dim i%
OpenW 1
FontSize = 14
Dim t# = Timer
For i% = 0 To 1000000
  I_MatrixMultiply(m1, m2, m3)
Next i%
Debug "by calculation : " & Timer - t#
// -----------------

Global Double a(0 .. 3, 0 .. 3)
Global Double b(0 .. 3, 0 .. 3)
Global Double c(0 .. 3, 0 .. 3)

Mat Set a() = 2
Mat Set b() = 2
t# = Timer : Dim i%
For i% = 0 To 1000000
  Mat Mul c() = a()*b()
Next
Debug "by Mat : " & Timer - t#
Erase a(), b(), c()
Stop


Proc I_MatrixMultiply(ByRef m1 As m_Matrix, ByRef m2 As m_Matrix, ByRef result As m_Matrix) Naked
  
  // ... Looks a lot but is terrible fast with because the adresses are stored in the adr-registers.
  // ... Multiplikation jeder Zeile von m1 mit jeder Spalte von m2
  result.m0 = m1.m0 * m2.m0 + m1.m1 * m2.m4 + m1.m2 * m2.m8 + m1.m3 * m2.m12
  result.m1 = m1.m0 * m2.m1 + m1.m1 * m2.m5 + m1.m2 * m2.m9 + m1.m3 * m2.m13
  result.m2 = m1.m0 * m2.m2 + m1.m1 * m2.m6 + m1.m2 * m2.m10 + m1.m3 * m2.m14
  result.m3 = m1.m0 * m2.m3 + m1.m1 * m2.m7 + m1.m2 * m2.m11 + m1.m3 * m2.m15
  
  result.m4 = m1.m4 * m2.m0 + m1.m5 * m2.m4 + m1.m6 * m2.m8 + m1.m7 * m2.m12
  result.m5 = m1.m4 * m2.m1 + m1.m5 * m2.m5 + m1.m6 * m2.m9 + m1.m7 * m2.m13
  result.m6 = m1.m4 * m2.m2 + m1.m5 * m2.m6 + m1.m6 * m2.m10 + m1.m7 * m2.m14
  result.m7 = m1.m4 * m2.m3 + m1.m5 * m2.m7 + m1.m6 * m2.m11 + m1.m7 * m2.m15
  
  result.m8 = m1.m8 * m2.m0 + m1.m9 * m2.m4 + m1.m10 * m2.m8 + m1.m11 * m2.m12
  result.m9 = m1.m8 * m2.m1 + m1.m9 * m2.m5 + m1.m10 * m2.m9 + m1.m11 * m2.m13
  result.m10 = m1.m8 * m2.m2 + m1.m9 * m2.m6 + m1.m10 * m2.m10 + m1.m11 * m2.m14
  result.m11 = m1.m8 * m2.m3 + m1.m9 * m2.m7 + m1.m10 * m2.m11 + m1.m11 * m2.m15
  
  result.m12 = m1.m12 * m2.m0 + m1.m13 * m2.m4 + m1.m14 * m2.m8 + m1.m15 * m2.m12
  result.m13 = m1.m12 * m2.m1 + m1.m13 * m2.m5 + m1.m14 * m2.m9 + m1.m15 * m2.m13
  result.m14 = m1.m12 * m2.m2 + m1.m13 * m2.m6 + m1.m14 * m2.m10 + m1.m15 * m2.m14
  result.m15 = m1.m12 * m2.m3 + m1.m13 * m2.m7 + m1.m14 * m2.m11 + m1.m15 * m2.m15
  
  °// ... Durchführen der Matrixmultiplikation
  °For i% = 0 To 3
  °For j% = 0 To 3
  °result(i * 4 + j) = m_rot(i * 4) * m_trans(j) + m_rot(i * 4 + 1) * m_trans(j + 4) + m_rot(i * 4 + 2) * m_trans(j + 8) + m_rot(i * 4 + 3) * m_trans(j + 12)
  °Next
  °Next
EndProc

by calculation : 0.0532157999995633
by Mat : 0.102792799999555

Not sure what happen on Intel.

Further it's possible to speed up the I_MatrixMultiply() by about 4 times with simple threading.
Calculate each of the 4 blocks in a different thread at once!

Last Edit: Jan 12, 2024 16:27:41 GMT 1 by Roger Cabo

(X)
God

I am not really God. I did not pick this "label". It just means I posted > 500 times.

Posts: 1,365

GB32 Mat VS manual Matrix Calculations ( speed test ) Jan 12, 2024 15:57:58 GMT 1 Roger Cabo likes this

Quote

Post by (X) on Jan 12, 2024 15:57:58 GMT 1

NOICE!

The procedure calculations definitely seems ~2 times faster than the built-in MAT functions!

Initial calculations seem to be slower, then perhaps they are performed from Cached memory?

On my LENOVO ThinkPad CORE i3

RUNTIME

dt MAT: 0.268395800000874
dt CALC: 0.13267439999602

dt MAT: 0.194163000011798
dt CALC: 0.0791352999939789

dt MAT: 0.161118400004469
dt CALC: 0.0783953000072017

dt MAT: 0.161858100012623
dt CALC: 0.085717100006363

COMPILED

dt MAT: 0.246614699985813
dt CALC: 0.126228499986055

dt MAT: 0.215408900010402
dt CALC: 0.0832398000141836

dt MAT: 0.161218499988152
dt CALC: 0.0783117000063527

dt MAT: 0.161470200004672
dt CALC: 0.0782862000022817

My Code mods...

$Library "gfawinx"
$Library "UpdateRT"
UpdateRuntime      ' Patches GfaWin23.Ocx
$StepOff

OpenW 1
FontSize = 14
BackColor = 0
Ocx TextBox ed1 = , 10, 10, Me.ScaleWidth - 20, Me.ScaleHeight - 20
.MultiLine = True

P_Test_MAT_MUL
P_Test_MAT_CALC

P_Test_MAT_MUL
P_Test_MAT_CALC

P_Test_MAT_MUL
P_Test_MAT_CALC

P_Test_MAT_MUL
P_Test_MAT_CALC

Do : Sleep : Until Me Is Nothing

Proc P_Test_MAT_CALC() Naked
  Type m_Matrix
    m0 As Double
    m1 As Double
    m2 As Double
    m3 As Double
    m4 As Double
    m5 As Double
    m6 As Double
    m7 As Double
    m8 As Double
    m9 As Double
    m10 As Double
    m11 As Double
    m12 As Double
    m13 As Double
    m14 As Double
    m15 As Double
  EndType
  
  Local t#, dt#, i%
  Local m1 As m_Matrix
  Local m2 As m_Matrix
  Local m3 As m_Matrix
  
  t# = Timer
  For i% = 0 To 1000000
    I_MatrixMultiply(m1, m2, m3)
  Next i%
  dt = Timer - t
  
  ed1 = ed1 & "dt CALC: " & dt & #13#10
EndProc

Proc P_Test_MAT_MUL() Naked
  
  Local t#, dt#, i%
  Local Double a(0 .. 3, 0 .. 3)
  Local Double b(0 .. 3, 0 .. 3)
  Local Double c(0 .. 3, 0 .. 3)
  
  Mat Set a() = 2
  Mat Set b() = 2
  
  t = Timer
  For i% = 0 To 1000000
    Mat Mul c() = a()*b()
  Next i%
  dt = Timer - t
  
  ed1 = ed1 & #13#10
  ed1 = ed1 & "dt MAT: " & dt & #13#10
  
  Erase a(), b(), c()
EndProc

Proc I_MatrixMultiply(ByRef m1 As m_Matrix, ByRef m2 As m_Matrix, ByRef result As m_Matrix) Naked
  
  // ... Looks a lot but is terrible fast with because the adresses are stored in the adr-registers.
  // ... Multiplikation jeder Zeile von m1 mit jeder Spalte von m2
  result.m0 = m1.m0 * m2.m0 + m1.m1 * m2.m4 + m1.m2 * m2.m8 + m1.m3 * m2.m12
  result.m1 = m1.m0 * m2.m1 + m1.m1 * m2.m5 + m1.m2 * m2.m9 + m1.m3 * m2.m13
  result.m2 = m1.m0 * m2.m2 + m1.m1 * m2.m6 + m1.m2 * m2.m10 + m1.m3 * m2.m14
  result.m3 = m1.m0 * m2.m3 + m1.m1 * m2.m7 + m1.m2 * m2.m11 + m1.m3 * m2.m15
  
  result.m4 = m1.m4 * m2.m0 + m1.m5 * m2.m4 + m1.m6 * m2.m8 + m1.m7 * m2.m12
  result.m5 = m1.m4 * m2.m1 + m1.m5 * m2.m5 + m1.m6 * m2.m9 + m1.m7 * m2.m13
  result.m6 = m1.m4 * m2.m2 + m1.m5 * m2.m6 + m1.m6 * m2.m10 + m1.m7 * m2.m14
  result.m7 = m1.m4 * m2.m3 + m1.m5 * m2.m7 + m1.m6 * m2.m11 + m1.m7 * m2.m15
  
  result.m8 = m1.m8 * m2.m0 + m1.m9 * m2.m4 + m1.m10 * m2.m8 + m1.m11 * m2.m12
  result.m9 = m1.m8 * m2.m1 + m1.m9 * m2.m5 + m1.m10 * m2.m9 + m1.m11 * m2.m13
  result.m10 = m1.m8 * m2.m2 + m1.m9 * m2.m6 + m1.m10 * m2.m10 + m1.m11 * m2.m14
  result.m11 = m1.m8 * m2.m3 + m1.m9 * m2.m7 + m1.m10 * m2.m11 + m1.m11 * m2.m15
  
  result.m12 = m1.m12 * m2.m0 + m1.m13 * m2.m4 + m1.m14 * m2.m8 + m1.m15 * m2.m12
  result.m13 = m1.m12 * m2.m1 + m1.m13 * m2.m5 + m1.m14 * m2.m9 + m1.m15 * m2.m13
  result.m14 = m1.m12 * m2.m2 + m1.m13 * m2.m6 + m1.m14 * m2.m10 + m1.m15 * m2.m14
  result.m15 = m1.m12 * m2.m3 + m1.m13 * m2.m7 + m1.m14 * m2.m11 + m1.m15 * m2.m15
  
  °// ... Durchführen der Matrixmultiplikation
  °For i% = 0 To 3
  °For j% = 0 To 3
  °result(i * 4 + j) = m_rot(i * 4) * m_trans(j) + m_rot(i * 4 + 1) * m_trans(j + 4) + m_rot(i * 4 + 2) * m_trans(j + 8) + m_rot(i * 4 + 3) * m_trans(j + 12)
  °Next
  °Next
EndProc

Last Edit: Jan 12, 2024 16:48:53 GMT 1 by (X)

I code for FUN...

Roger Cabo
God

Posts: 711

GB32 Mat VS manual Matrix Calculations ( speed test ) Jan 12, 2024 16:40:22 GMT 1 scalion likes this

Quote

Post by Roger Cabo on Jan 12, 2024 16:40:22 GMT 1

Thank you!!!!!

I'm very exited about the 4 threads test.. it seems about 9 times faster than the Mat Calulation.
The crazy thing about is: each division of a code block accelerate the calculation.. not sure what happen on 8 Threads.

using :
thread 0)
result.m0 = m1.m0 * m2.m0 + m1.m1 * m2.m4 + m1.m2 * m2.m8 + m1.m3 * m2.m12
result.m1 = m1.m0 * m2.m1 + m1.m1 * m2.m5 + m1.m2 * m2.m9 + m1.m3 * m2.m13

thread 1)
result.m2 = m1.m0 * m2.m2 + m1.m1 * m2.m6 + m1.m2 * m2.m10 + m1.m3 * m2.m14
result.m3 = m1.m0 * m2.m3 + m1.m1 * m2.m7 + m1.m2 * m2.m11 + m1.m3 * m2.m15

until 8)
etc..

When using 8 threads should be about 14 times faster. But the thread safety is important.
All must start with a global Thread32bit as Int var.

if Thread32bit = 0 // Be sure all the required threads are free
   Call Thread01(......)
   Call Thread02(......)
   Call Thread03(......)
   Call Thread04(......)

   While Thread32bit = $F // wait until all are finished... four bits to set by Bset()
      PeekEvent
   Wend
Else
   I_CalculateByDefault()
Endif

Last Edit: Jan 12, 2024 16:43:25 GMT 1 by Roger Cabo