Help for vec3/4 Library Speedtest C/Linux

category: code [glöplog]

tl;dr
Test this.
https://github.com/Kabelmaulwurf/vec3

Hi guys,
after getting bored to implement my vec3/4 stuff again and again I just wanted to make a "library" out of it.
Thus I made one and added SSE Support with inline asm.
But I noticed major performance differences between my machines and wondered whiy.
So i wanted to have a broad field test on as many machines a possible.

And now you come in play.
I need you to test my library :)

Testing is as easy as
1. checkout git
2. make
3. run ./test.sh
4. pastebin results

Analysis of the results will follow like this
BB Image

Would be nice to have some results and suggestions.
Also every tester recieves beer the next time we see us :)

https://github.com/Kabelmaulwurf/vec3

P.S. Architecture/Testing info following soon but you got the ugly source anyway :P

added on the 2011-12-12 23:11:08 by Kabelmaulwurf

Tried compiling under OSX, however it doesn't support -masm, so i just substituted that. Doesn't compile anyway

Code:

gcc -c -Wall -Wextra -O3 -funroll-loops -nasm=intel -c -o SpeedTest.o SpeedTest.c
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `addps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:96:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `subps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:125:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `mulps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:154:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `divps'
/var/folders/rw/bmshy6fx5_ndn_0qk_xkxkj00000gn/T//cc68hP1V.s:183:too many memory references for `movups'
make: *** [SpeedTest.o] Error 1

added on the 2011-12-12 23:26:58 by Movi

Also, i guess I'm too stupid to figure out how to fix your makefile for my Gentoo

Code:


gcc -o SpeedTest SpeedTest.o
SpeedTest.o: In function `v4_Length':
SpeedTest.c:(.text+0x228): undefined reference to `sqrtf'
SpeedTest.o: In function `v4_Normalize':
SpeedTest.c:(.text+0x28f): undefined reference to `sqrtf'
collect2: ld returned 1 exit status
make: *** [SpeedTest] Error 1

I know I'm supposed to fix -lm into Speedtest.o somewhere, but it doesn't seem to catch on..

added on the 2011-12-12 23:45:53 by Movi

Figured it out. Here are the results

added on the 2011-12-12 23:59:17 by Movi

Oh sorry about the missing -lm going to fix that now.
And many thx for the results!

added on the 2011-12-13 00:29:52 by Kabelmaulwurf

http://pastebin.com/Bk9rtwHU

added on the 2011-12-13 00:34:32 by bartman

I also fixed it under OSX - you need to use %% for the xmm registers. It'It's a little bit different for the test script tho - no cpuinfo on darwin. I'm hacking something together, so should have the results soon.

added on the 2011-12-13 00:35:07 by Movi

Results from Darwin

added on the 2011-12-13 00:41:31 by Movi

Just did the graphs, it seems that it's faster under linux in a VM than native on OSX. Apple quality..

added on the 2011-12-13 00:44:01 by Movi

thx to bartman.
thx again to Movi

@movi: an extra thx for testing it in a VM :)
so plotting works at least ? :)

added on the 2011-12-13 00:54:54 by Kabelmaulwurf

Yeah. Couldn't neatly compile matplotlib under OSX (without installing a yet 3rd copy of python), so did it under the VM. Oh, and i read the graphs in the wrong order. It is slower under the VM, so everything checks out :)

added on the 2011-12-13 00:58:34 by Movi

To make good use of SSE its a bit more complicated than that. Check this out for example:
http://bullet.svn.sourceforge.net/viewvc/bullet/trunk/Extras/vectormathlibrary/include/vectormath/

You also have xnamath.h in the directxsdk.

What you want to look for is how they deal with return types, byte alignment (16 byte boundaries) with padding.

IMHO the usefulness of SSE when it comes to 3d vectormath is somewhat limited. If you really need to crunch you're better off using Eigen or GPGPU stuff. Using SSE compiler switches in general seems to be a good idea though.

added on the 2011-12-13 01:03:02 by Yomat

@ Yomat : Thx for the link.
I searched for some good source code but got none.

Only the id-lib math code from doom 3 which is quite nice,but c++ __asm which got some better compiler integration than gcc does.

Yeah alignment was a pitty when i experimented with vec3 and my x,y,z struct.

I dont really need to crunch,this should just be the first step in getting my stuff together ;)

added on the 2011-12-13 01:13:29 by Kabelmaulwurf

So why not using intrinsics instead of raw assembler code? Such as those defined in e.g. xmmintrin.h and emmintrin.h ?

added on the 2011-12-13 06:48:54 by nystep

My prod use SSE intrinsics and includes source code.
I hope it will help you.
(But it will work only in vc++)
http://www.pouet.net/prod.php?which=56553

I also recommend you to use intrinsics instead of asm.
If your inline function have code like this:
addps xmm0 xmm1
mulps xmm0 xmm1
They will always use xmm0 and xmm1 register.
So your compiler might insert instructions to copy/backup register.

When your code do "x = a+b+c*d;",
Compiler will generate code like this:
'xmm0 = a
'xmm1 = b
addps xmm0, xmm1
'backup xmm0
'xmm0 = c
'xmm1 = d
mulps xmm0, xmm1
'restore a+b to xmm1
addps xmm0, xmm1

SSE have xmm0~xmm7.
If you use intrinsics, compiler will assign better registers or memory to your code.
And most of modern compilers (vc++, gcc, icc) support it.

added on the 2011-12-13 09:25:01 by tomohiro

Plus ICC and GCC generate better SSE code than the VC compiler.

added on the 2011-12-13 10:49:04 by raer

tomohiro: Thanks for the source.
I used GCC intrinsics before and had the problem that GCC created a huuuge amount off instructions for backing up registers and moveing stuff around.
So my guess was just to inline it myself and shorten this.

added on the 2011-12-13 11:47:54 by Kabelmaulwurf

Are you sure v4_Compare(..) works ?
Isn't comparing floats for equality evil ?

added on the 2011-12-13 12:14:02 by flure

flure : wooops saw that the == version is still in there.

Just needed a function to return whether the one is greater then the other.
Was only for some sorting stuff...

added on the 2011-12-13 12:21:04 by Kabelmaulwurf

Quote:

I used GCC intrinsics before and had the problem that GCC created a huuuge amount off instructions for backing up registers and moveing stuff around.
So my guess was just to inline it myself and shorten this.

You can not take into account instruction pairing, instruction and cache latency and whatnot. That's what compilers are there for. Sadly only the Intel compiler seems to create good SSE code...

added on the 2011-12-13 14:20:04 by raer

@raer: that's why my next idea is to write stuff directly in asm and call it from c.
Just passing the address of the operands.

wasnt there a release from an alternate compiler made by intel or is my brain making up something ?

Need to seek thru my bookmarks when back at home.

added on the 2011-12-13 14:50:18 by Kabelmaulwurf

Sorry for repost, giving LLVM a try now.
If the I got time to hit up the VM.
Btw, Thanks for all the information.

added on the 2011-12-13 14:56:01 by Kabelmaulwurf

Normally compilers should generate proper asm code from intrinsics and that should be the way to go, but... what I wrote. Hand-coded asm is almost always slower than code created by a GOOD compiler.
Anyway. Try LLVM. Might be worth a shot.

added on the 2011-12-13 15:05:55 by raer

Results:
http://pastebin.com/cknhBcQ4
cpuinfo:
http://pastebin.com/2yB1d6Un

added on the 2011-12-13 23:13:05 by joooo

No wait, this is what you want to see

added on the 2011-12-13 23:19:47 by joooo

pouët.net

Help for vec3/4 Library Speedtest C/Linux

login