Short intro to removing code branches

category: code [glöplog]

Hey guys,

I wrote a post where I've tried to explain how to write branchless code and thought it would be great to hear some feedback from you on it. Here's the link

Let me know what you think!

added on the 2015-11-22 12:08:11 by jabudcha

Have you actually benchmarked these two? Any normal C compiler these days is likely to transform something as simple as the first one to a conditional move instruction anyway, which is faster than trying to implement it yourself.

Not to mention that branches are not a problem; unpredictable branches are.

added on the 2015-11-22 12:20:17 by Sesse

I just ran a quick benchmark on this, and you're right, when the branches are not predictable I get some mixed results on which one executes faster, but when the branching is more predictable then this the branchless code seems to run slower.

added on the 2015-11-22 14:47:59 by jabudcha

I compiled both of your variants with GCC (5.2.1), locking “some_other_value” to an arbitrary value (89). The first yields:

Code:

imul   edx,edi,0x17
lea    eax,[rdi-0x5]
cmp    edi,0x59
cmovge eax,edx

and the branchless:

Code:

imul   eax,edi,0x17
lea    ecx,[rdi-0x5]
mov    edx,0x59
sub    edx,edi
sar    edx,0x1f
xor    eax,ecx
and    eax,edx
xor    eax,ecx

I'll let you draw your own conclusions. :-)

added on the 2015-11-22 14:56:29 by Sesse

So far, on my graphics card at least, such tricks are still useful for shader programming. But for the CPU, yeah, hard to beat the compiler (except using parallelization and cache optimization).
Sesse, in your branchless disassembly, the xor eax, ecx could be executed before (interleaved between 2 operations uppon edx), so that it would be free. Strange that the compiler did not figure it out. Or did I miss something?

added on the 2015-11-23 08:47:24 by Soundy

I think most CPUs already do out-of-order execution on their own, so that shouln't matter much?

added on the 2015-11-23 09:48:44 by PulkoMandy

Essential optimization.

Great for parallel procedsing, like opencl or even just glsl.

Because there a compiler that often makes outdated suboptimal assumptions on how to optimite for modern Hardware has to decide how to compile segments to pipelines.

The More constants and The less branches, The faster and better will ANY compiler perform.

And any branching Path can easily pause flow controll, up to doubling The number of gpu cycles.

Even in glsl a dotproduct can often be done faster than an if() statement due to parallel pipes being best without valves branching between them.

added on the 2016-04-03 20:46:07 by ollj

pouët.net

Short intro to removing code branches

login