Help for vec3/4 Library Speedtest C/Linux
category: code [glöplog]
tl;dr
Test this.
https://github.com/Kabelmaulwurf/vec3
Hi guys,
after getting bored to implement my vec3/4 stuff again and again I just wanted to make a "library" out of it.
Thus I made one and added SSE Support with inline asm.
But I noticed major performance differences between my machines and wondered whiy.
So i wanted to have a broad field test on as many machines a possible.
And now you come in play.
I need you to test my library :)
Testing is as easy as
1. checkout git
2. make
3. run ./test.sh
4. pastebin results
Analysis of the results will follow like this
Would be nice to have some results and suggestions.
Also every tester recieves beer the next time we see us :)
https://github.com/Kabelmaulwurf/vec3
P.S. Architecture/Testing info following soon but you got the ugly source anyway :P
Test this.
https://github.com/Kabelmaulwurf/vec3
Hi guys,
after getting bored to implement my vec3/4 stuff again and again I just wanted to make a "library" out of it.
Thus I made one and added SSE Support with inline asm.
But I noticed major performance differences between my machines and wondered whiy.
So i wanted to have a broad field test on as many machines a possible.
And now you come in play.
I need you to test my library :)
Testing is as easy as
1. checkout git
2. make
3. run ./test.sh
4. pastebin results
Analysis of the results will follow like this
Would be nice to have some results and suggestions.
Also every tester recieves beer the next time we see us :)
https://github.com/Kabelmaulwurf/vec3
P.S. Architecture/Testing info following soon but you got the ugly source anyway :P
Tried compiling under OSX, however it doesn't support -masm, so i just substituted that. Doesn't compile anyway
Code:
gcc -c -Wall -Wextra -O3 -funroll-loops -nasm=intel -c -o SpeedTest.o SpeedTest.c
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `addps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `subps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `mulps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `divps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups'
make: *** [SpeedTest.o] Error 1
Also, i guess I'm too stupid to figure out how to fix your makefile for my Gentoo
I know I'm supposed to fix -lm into Speedtest.o somewhere, but it doesn't seem to catch on..
Code:
gcc -o SpeedTest SpeedTest.o
SpeedTest.o: In function `v4_Length':
SpeedTest.c:(.text+0x228): undefined reference to `sqrtf'
SpeedTest.o: In function `v4_Normalize':
SpeedTest.c:(.text+0x28f): undefined reference to `sqrtf'
collect2: ld returned 1 exit status
make: *** [SpeedTest] Error 1
I know I'm supposed to fix -lm into Speedtest.o somewhere, but it doesn't seem to catch on..
Figured it out. Here are the results
Oh sorry about the missing -lm going to fix that now.
And many thx for the results!
And many thx for the results!
I also fixed it under OSX - you need to use %% for the xmm registers. It'It's a little bit different for the test script tho - no cpuinfo on darwin. I'm hacking something together, so should have the results soon.
Just did the graphs, it seems that it's faster under linux in a VM than native on OSX. Apple quality..
thx to bartman.
thx again to Movi
@movi: an extra thx for testing it in a VM :)
so plotting works at least ? :)
thx again to Movi
@movi: an extra thx for testing it in a VM :)
so plotting works at least ? :)
Yeah. Couldn't neatly compile matplotlib under OSX (without installing a yet 3rd copy of python), so did it under the VM. Oh, and i read the graphs in the wrong order. It is slower under the VM, so everything checks out :)
To make good use of SSE its a bit more complicated than that. Check this out for example:
http://bullet.svn.sourceforge.net/viewvc/bullet/trunk/Extras/vectormathlibrary/include/vectormath/
You also have xnamath.h in the directxsdk.
What you want to look for is how they deal with return types, byte alignment (16 byte boundaries) with padding.
IMHO the usefulness of SSE when it comes to 3d vectormath is somewhat limited. If you really need to crunch you're better off using Eigen or GPGPU stuff. Using SSE compiler switches in general seems to be a good idea though.
http://bullet.svn.sourceforge.net/viewvc/bullet/trunk/Extras/vectormathlibrary/include/vectormath/
You also have xnamath.h in the directxsdk.
What you want to look for is how they deal with return types, byte alignment (16 byte boundaries) with padding.
IMHO the usefulness of SSE when it comes to 3d vectormath is somewhat limited. If you really need to crunch you're better off using Eigen or GPGPU stuff. Using SSE compiler switches in general seems to be a good idea though.
@ Yomat : Thx for the link.
I searched for some good source code but got none.
Only the id-lib math code from doom 3 which is quite nice,but c++ __asm which got some better compiler integration than gcc does.
Yeah alignment was a pitty when i experimented with vec3 and my x,y,z struct.
I dont really need to crunch,this should just be the first step in getting my stuff together ;)
I searched for some good source code but got none.
Only the id-lib math code from doom 3 which is quite nice,but c++ __asm which got some better compiler integration than gcc does.
Yeah alignment was a pitty when i experimented with vec3 and my x,y,z struct.
I dont really need to crunch,this should just be the first step in getting my stuff together ;)
So why not using intrinsics instead of raw assembler code? Such as those defined in e.g. xmmintrin.h and emmintrin.h ?
My prod use SSE intrinsics and includes source code.
I hope it will help you.
(But it will work only in vc++)
http://www.pouet.net/prod.php?which=56553
I also recommend you to use intrinsics instead of asm.
If your inline function have code like this:
addps xmm0 xmm1
mulps xmm0 xmm1
They will always use xmm0 and xmm1 register.
So your compiler might insert instructions to copy/backup register.
When your code do "x = a+b+c*d;",
Compiler will generate code like this:
'xmm0 = a
'xmm1 = b
addps xmm0, xmm1
'backup xmm0
'xmm0 = c
'xmm1 = d
mulps xmm0, xmm1
'restore a+b to xmm1
addps xmm0, xmm1
SSE have xmm0~xmm7.
If you use intrinsics, compiler will assign better registers or memory to your code.
And most of modern compilers (vc++, gcc, icc) support it.
I hope it will help you.
(But it will work only in vc++)
http://www.pouet.net/prod.php?which=56553
I also recommend you to use intrinsics instead of asm.
If your inline function have code like this:
addps xmm0 xmm1
mulps xmm0 xmm1
They will always use xmm0 and xmm1 register.
So your compiler might insert instructions to copy/backup register.
When your code do "x = a+b+c*d;",
Compiler will generate code like this:
'xmm0 = a
'xmm1 = b
addps xmm0, xmm1
'backup xmm0
'xmm0 = c
'xmm1 = d
mulps xmm0, xmm1
'restore a+b to xmm1
addps xmm0, xmm1
SSE have xmm0~xmm7.
If you use intrinsics, compiler will assign better registers or memory to your code.
And most of modern compilers (vc++, gcc, icc) support it.
Plus ICC and GCC generate better SSE code than the VC compiler.
tomohiro: Thanks for the source.
I used GCC intrinsics before and had the problem that GCC created a huuuge amount off instructions for backing up registers and moveing stuff around.
So my guess was just to inline it myself and shorten this.
I used GCC intrinsics before and had the problem that GCC created a huuuge amount off instructions for backing up registers and moveing stuff around.
So my guess was just to inline it myself and shorten this.
Are you sure v4_Compare(..) works ?
Isn't comparing floats for equality evil ?
Isn't comparing floats for equality evil ?
flure : wooops saw that the == version is still in there.
Just needed a function to return whether the one is greater then the other.
Was only for some sorting stuff...
Just needed a function to return whether the one is greater then the other.
Was only for some sorting stuff...
Quote:
I used GCC intrinsics before and had the problem that GCC created a huuuge amount off instructions for backing up registers and moveing stuff around.
So my guess was just to inline it myself and shorten this.
You can not take into account instruction pairing, instruction and cache latency and whatnot. That's what compilers are there for. Sadly only the Intel compiler seems to create good SSE code...
@raer: that's why my next idea is to write stuff directly in asm and call it from c.
Just passing the address of the operands.
wasnt there a release from an alternate compiler made by intel or is my brain making up something ?
Need to seek thru my bookmarks when back at home.
Just passing the address of the operands.
wasnt there a release from an alternate compiler made by intel or is my brain making up something ?
Need to seek thru my bookmarks when back at home.
Sorry for repost, giving LLVM a try now.
If the I got time to hit up the VM.
Btw, Thanks for all the information.
If the I got time to hit up the VM.
Btw, Thanks for all the information.
Normally compilers should generate proper asm code from intrinsics and that should be the way to go, but... what I wrote. Hand-coded asm is almost always slower than code created by a GOOD compiler.
Anyway. Try LLVM. Might be worth a shot.
Anyway. Try LLVM. Might be worth a shot.
Results:
http://pastebin.com/cknhBcQ4
cpuinfo:
http://pastebin.com/2yB1d6Un
http://pastebin.com/cknhBcQ4
cpuinfo:
http://pastebin.com/2yB1d6Un