Raymarching Beginners' Thread
category: code [glöplog]
I didn't know Mike could hover :D
FreeFull: Mike's DF is not particularly clean. IQ's original shader has artifacts as well for the same reason. However, he reduced shadow blurring radius, so it's not as visible.
Omid, please put an email in your profile. i'll contact you
lollol : omidpforce@gmail.com
btw how can i add email to my profile ?
btw how can i add email to my profile ?
lollop : what happened ? Gave up on emailing me ? I'm still waiting.
If you had to raymarch such a scene, how would you proceed ? (especially for the part between the floor/ceiling and the twisted tubes).
Modeling wise, using Mercury's hg_sdf:
pRotate(p.xz, p.y);
pModPolar(p.xz, 6.0);
p.x += 1.0; // this is probably wrong, off the top of my head, you need to offset x or z positive or negative
float d = fCircle(p.xz, 0.3);
d = fOpUnionSoft(d, abs(p.y - 10.0), 0.5); // maybe that abs() needs a - in front of it, again off the top of my head
Possibly vary radius per cell and offset p.y by some smooth random to make it wobbly.
Texture wise you could try a tri-planar method, or figure out a 3D noise that looks sort of like this, at the top you could get away with a smoothstepped perlin() but at the bottom it has a more maze-like pattern that I don't exactly know a good 3D function for.
This Shadertoy by Shane could be an interesting start:
https://www.shadertoy.com/view/llGGzh
Just my 2 cents!
Sounds like a fun scene to do but gotta eat before I can properly sit down on something like this :)
pRotate(p.xz, p.y);
pModPolar(p.xz, 6.0);
p.x += 1.0; // this is probably wrong, off the top of my head, you need to offset x or z positive or negative
float d = fCircle(p.xz, 0.3);
d = fOpUnionSoft(d, abs(p.y - 10.0), 0.5); // maybe that abs() needs a - in front of it, again off the top of my head
Possibly vary radius per cell and offset p.y by some smooth random to make it wobbly.
Texture wise you could try a tri-planar method, or figure out a 3D noise that looks sort of like this, at the top you could get away with a smoothstepped perlin() but at the bottom it has a more maze-like pattern that I don't exactly know a good 3D function for.
This Shadertoy by Shane could be an interesting start:
https://www.shadertoy.com/view/llGGzh
Just my 2 cents!
Sounds like a fun scene to do but gotta eat before I can properly sit down on something like this :)
Thanks for you help. I am going to take a look at what you posted.
Looks like an Old shoot them up background
imho the question should be more like "why would you want raymarch that?" ;)
Tigrou: beautiful, I already thought it's raymarched.. If you would get it done that way, that would be awesome, indeed! (I really do like the nearfield/farfield look)
one question:
is there any difference between these:
and:
?
Both use cartesian equations right? So wether you use raytracing or raymarching, the equation of an object is the same for both rendering methods?
Wanting to exemplify the similarities of the object-equations between those two methods.
is there any difference between these:
Code:
float sphereDist( vec3 p, float radius )
{
length( p ) - radius;
}
and:
Code:
(x-xo)²+(y-yo)²+(z-zo)² = R²
?
Both use cartesian equations right? So wether you use raytracing or raymarching, the equation of an object is the same for both rendering methods?
Wanting to exemplify the similarities of the object-equations between those two methods.
rudi: if you want to see similarity better you should write the implicit equation of the sphere as:
with a simple algebraic manipulation (adding R to both sides and squaring both sides) you will get:
Code:
length(p) - R = 0,
where length(p) = sqrt(p.x²+p.y²+p.z²)
with a simple algebraic manipulation (adding R to both sides and squaring both sides) you will get:
Code:
p.x²+p.y²+p.z² = R²
One more thing: there is a subtle difference though. Both equations are equivalent only if R>=0. If R is negative, set of solutions to the first equation, length(p) - R = 0, is empty, while solution to the second (squared) equation is a sphere with radius -R.
Nevertheless, negative radius doesn't make sense in your application.
Nevertheless, negative radius doesn't make sense in your application.
yes yes. thanks tomkh for the implicit equations, it shows exactly that they are similar if R>=0. and you're right about that negative radius makes no sense in our case.
Once upon a time, a fox who had no tail tried to convince all other foxes to cut out their tails because doing so was more "fashionable". And once upon a time, some people who COULD NOT WRITE COMPUTER PROGRAMS tried to convince all the other people that they should not learn how to write their own engines, but instead use the easily available engines, because PROGRAMMING ENGINES WAS USELESS. Complete the logical reasoning yourself. Good luck on this hard puzzle. Kisses.
💩
Is there a way to use bounding boxes to optimize a scene with lots of objects ?
My idea would be to first raytrace (or raymarch?) against bounding boxes. If done at GPU side, could the results (which ray hit which objects) be stored in a texture ?
Then once we go that info, to do regular raymarching but only with bounding boxes who got hit.
Ideally, I would do that per ray. So for example ray at x = 0.5 y = 0.6 would raymarch against 5 objects of the scene but ray at x = 0.7 y = 0.1 would raymarch against 2 objects. I'm unsure if it's possible. AFAIK GPU always take both side of a IF condition, and because of that time to calculate pixel value in a shader is the same for all pixels.
Any ideas?
My idea would be to first raytrace (or raymarch?) against bounding boxes. If done at GPU side, could the results (which ray hit which objects) be stored in a texture ?
Then once we go that info, to do regular raymarching but only with bounding boxes who got hit.
Ideally, I would do that per ray. So for example ray at x = 0.5 y = 0.6 would raymarch against 5 objects of the scene but ray at x = 0.7 y = 0.1 would raymarch against 2 objects. I'm unsure if it's possible. AFAIK GPU always take both side of a IF condition, and because of that time to calculate pixel value in a shader is the same for all pixels.
Any ideas?
memory fetches are pretty slow on the GPU, my guess is that not storing but instead marching inside the AABBs right after determining a leaf hit would be faster (though you need something to enforce thread coherency, for primary rays it should be okay since pixel shaders are executed pretty coherently)
plus for a real benefit your objects need to be pretty complicated, it's the idea of the sphere-march-thing that the sdf itself also acts as a sort of an accelerating structure
plus for a real benefit your objects need to be pretty complicated, it's the idea of the sphere-march-thing that the sdf itself also acts as a sort of an accelerating structure
and oh also, yes. people have done raymarching against simplified versions when far away just for this reason, so for example replacing a torus with a displacement by just a (slightly bulkier) torus
Re. the "if" part, consider it like this. You have a bit of code, something like:
Each GPU core is going to render a block of maybe 32 or 64 pixels, using SIMD. If some pixels hit the box and some don't, you lose coherency, which will kill performance. BUT unless those bounding boxes are very small, in most cases a block of pixels will be all inside or all outside the bounding box.
So in reality, only blocks of pixels on the edge of bounding boxes will be slow, which might not be a problem if there are few / big bounding boxes.
However, there's another problem: when the shader gets compiled, it doesn't know if it's going to hit one branch or the other, or both. So it has to allow for both, and it has to allocate registers for both branches. That means it's far more likely to run out of registers, which hits performance hard.
This is maybe the main reason for avoiding branches on the GPU, but it depends on what's in the branch. If it's something like "if (x) raymarch; else { just set the pixel colour and exit; }" it's not an issue.
Code:
if (hitBoundingBox) {
raymarch();
else {
doSomethingElse():
}
Each GPU core is going to render a block of maybe 32 or 64 pixels, using SIMD. If some pixels hit the box and some don't, you lose coherency, which will kill performance. BUT unless those bounding boxes are very small, in most cases a block of pixels will be all inside or all outside the bounding box.
So in reality, only blocks of pixels on the edge of bounding boxes will be slow, which might not be a problem if there are few / big bounding boxes.
However, there's another problem: when the shader gets compiled, it doesn't know if it's going to hit one branch or the other, or both. So it has to allow for both, and it has to allocate registers for both branches. That means it's far more likely to run out of registers, which hits performance hard.
This is maybe the main reason for avoiding branches on the GPU, but it depends on what's in the branch. If it's something like "if (x) raymarch; else { just set the pixel colour and exit; }" it's not an issue.
to extrapolate on "losing coherency", essentially all the "threads" in the SIMD-ish block have to execute the same commands. So if all the threads pick either raymarch(); or doSomethingElse();, they all execute that function. But if some of them pick the former and some the latter, each thread essentially executes both branches -- they just disable reads/writes for the one they didn't need to. This makes the code run potentially quite slow.
the same effect happens with loops; all of the threads of the block will keep hanging around, "executing" the loop until all of them have exited it. as an extreme, think of 31 threads doing one iteration and 1 thread doing a hundred iterations -- it takes as long as all threads doing those hundred iterations (if no global memory accesses etc. are involved, those are obviously disabled for the shadow-executing threads and might have a huge performance impact)
the same effect happens with loops; all of the threads of the block will keep hanging around, "executing" the loop until all of them have exited it. as an extreme, think of 31 threads doing one iteration and 1 thread doing a hundred iterations -- it takes as long as all threads doing those hundred iterations (if no global memory accesses etc. are involved, those are obviously disabled for the shadow-executing threads and might have a huge performance impact)
i got bored enough to read and reply, to 10 pages earlier:
fear my nonsense:
A single pass errosion can be emulated with some multi octave fbm, multiplied by a (tiled) voronoi. it is commonly used to do lava flowing around rocks or ice sheets. with simpler noise, but the right height-map noise and exponential functions with different exponents per octave may make some nice rivers out of that.
---
domain warping in raymarching is simple. twist() is the most common, easily found on shadertoy.com, to turn a cylinder into a helix. large lipschitz constants are almost guaranteed.
bending a cylinder or box around a point that is nowhere near the surface of the cylinder is also quite common.
all domain warps come down to a 3d to 3d transformation matrix that is very non-uniform.
it quickly ends up becoming a (kaleidoscopic) iterated function system fractal, with complex number transformations.
changing the ray direction during ray marching is NOT how do do it. because inverse-warping the set is equal to warping the ray.
what you do is, convert yout transformation into a mat3() and then do inverse(mat3())
----
8 pages late;
hornet, your break condition for the raymarching loop is heuristic, meaning, sphere tracking approaches a surface, but will unlikely ever hit it (or below it), it will however get very close to the surface relatively quickly.
heuristics is not about getting the EXACT result, but quickly estimating a close enough result after arbitrarily defining what you consider to be "close enough"="epsilon>0." ; "close enough" means; if (abs(a)<epsilon){printf "close enough"; break} else continue
printf "trying to get closer";
There is a HUGE temptation to set epsilon==0. which completely misses the point of abs(a) never reaching ==0. , just like Achilles turtle race, it takes too long to have the precision get too small to round a to exactly 0.
----
bloodnok
i am working on the pony shaders.
made a cherileee cutiemark
i want them to be better than meatballs, making things tricky.
bloodnok
for mixing different types of geometry, you want to do "deferred shading"
in general, deferred shading generates multiple fullscreenmaps (often at quater resolution, from different sources, and mixes them together in a final composition with a few min() and max() functions and usually some gaussian blur.
its advantage, besides mixing different sources, is to delay the trickier functions, like sqrt) as long as possibly, by only storing squared distances in the buffered images = dot(a,a), instead of length(a)= sqrt(dot(a,a));
the most recent unity-engine has multiple plugins for deferred shading, basically turning meshes into parametric signed distance functions, to be displayed and mixed as such in real time. sadly this is mostly Japanese and not localized well, because their motivation is to mix MMD meshes with raymarching geometry and effects (like fractals, cloth, clouds, erosion and thermodynamics)
---
7 pages late:
Tigrou
simple solution is, all rays have the same origin, the focal point of a camera lens. which is very arbitrary, best set to rayorigin=focalpoint=vec3(0);
and the ray direction = vec3(fragmentpos.xy,orthogonalDistancefScreenRayOrigin)-rayorigin;
this automatically turns the z.axis into the "distance to view screen" axis, using the same domain namespace as zNear and zFar.
ɧคɾɗվ
your cone disatance function must be wrong.
your error is likely; not using a good estimate for euclidean distance in you distance function.
length(a)=sqrt(dot(a,a))=pow(dot(a,a), 0.5),exp(log(a.x*a,x+a.x*a.x+a.z*a.z)*0.5);
//exp() and log() are inverses exponential functions of the same base, best baseE=base2.71... , base2 can be faster and more precise, but will constrain your maximum range a lot more.
manhattan-taxicap distance is faster than exp(log()) on some systems, but it is insufficient for raymarching/spheretracking
; so my guess is that your sqrt() is not precise enough.
ɧคɾɗվ
same answer i wrote to bloodnok above:
"deferred shading" and a fuckton of "uvw mapping" is your term. this is a common geography problem. making a plane that is warped in 3d space to a planar 2d plane, and its inverse function. the problem only gets trickier when you enter low earth orbit with high enough speeds to stay in orbit, you angles must be precise or you waste fuel, therefore nasa has all your solutions to good plane-to-sphere mapping (and its inverse). not kidding. earth is not spherical and its gravitational field is not uniform. this shit is tricky!
---
3 pages late
you are confusing BV(H) with quad-trees, and all the other hierarchical structures.
you do not even need the hiearchy part of a BVH, a single simple bounding volume for each distance field can easily double the frame-rate most times. hoiierarchy iterations get diminishing returns, up to 10x max performance:
https://www.shadertoy.com/view/XlVXDz
BV(H) is faster on older hardware, as it requires a lot of if() branches which are exponentially worse for newer hardware.
BV(H) is significantly faster on higher resolution (divided by number of total fragments), because you likely have more fragments that miss a BV(H) than fragments that hit a BV(H), increasing exponentially over distance, depending on your field of view.
BVH big shortcoming is that it is much more static, near impossible to recompute a complete BVH 60 times per seconds with dynamic assets, involving a complex vertex shader to do that job, unlikely worth it, unless the BVH is (almost completely) static.
tree-tessellation is faster on newer hardware.
tree-tessellation is better for somewhat uniform patches, like lakes, forests, settlements, that tent to be shaped like voronoi cells or birdoid clusters.
quad-tree-tessellation is more easily done dynamically.
-----
cupe should try what i think that omid is talking about:
A (voxel)traversal loop, that contains a raymarching loop.
this is not even deferred shading, its just that the context/wrapper for raytracing is a traversal loop.
while(tracing) everything inside of OTHER tiles is completely ignored, as if it does not exist. the distance field is cut off outside the tile you are marching inside of.
as proof of concept for the below
https://www.shadertoy.com/view/4sB3D1
and
https://www.shadertoy.com/view/XsXBDX
have a traversal-loop( with a Tracing-loop() inside of it).
it traces conic trees inside of non uniform space filling 2d tiles that have <2 trees per tile.
within a tile, it only traces a single cone, then (most likely) traverses on to the next tile (assuming no other break condition is met).
this is just a slightly different variant of a similar concept; marching within voxels.
march(){
for(){} //your raymarching loop in here, or raytrace a simple shape, whatever.
//the fun part is that the traverse() function below constrains size and complexity of this distance field or the number of things to trace.
either hits a surface (within the tile), or exits the tile-boundary or runs out of steps (within the tile)
}
traverse(){
for(){
//your traversal loop here
// can be simple 2d tessellation, hexagonal, voxel.
//can be a 3d honeycomb
//may or may not be quadtree or octee.
// is NOT a bounding volume HIERARCHY though, no reason, i just don't like them enough yet.
march();//this contains the march() loop above!!!
}
this outer traversal-loop either
runs out of traversal-steps
or goes over zFar (which is measured in voxels AND/OR distance to camera)
or has gotten the a return-code from march() that a surface has been hit.
or may have gotten return-code from that march() ran out of march-steps, in which case it will just continue marching where it stopped by +1 iteration trough the same tile, wasting -1 traversal-steps within the same tile.
}
ideas are cheap. implementing them is the tricky challenge where you meet your limits.
and coding is relatively easy, just takes time. the most tricky part is PR and marketing, because there you deal with a lot of irrational and unpredictable behavior, requiring improvisation.
---
2 pages late:
tompk "I guess the last useful post you are referring to is by ollj O_o"
i dont think my posts are THAT useful, but i assume no sarcasm.
oh my so many dumb typos on page 70.
and so many ideas of mine that i did not develop much, just archived.
---
Tigrou
yes, you can distribute bounding box computation between cpu and gpu, and let the cpu focus on what is within the box boundaries. its ideal deferred shading and it is quite famous among 101 of openCL computing.
this is tricky and bulky though, and in the long run, diminishing returns will increase your energy cost. this is useless for mobile devices, due to extra memory required for all the instructions. unless you do crowd computing with it, then go for openCL implementing just that. many shitty xboxes progressed on a lot of crowd computaton since 2003 this way.
in the long run, your bottleneck is the pipeline between gpu and cpu memory, to a point where VulkanAPI is your only option to do this in real time, where this bottleneck is mostly solved, or at least 50* wider open than in openGL.
yes, bounding boxes on the gpu increase fps (especially for older hardware) at a cost of lowering compile time and at a cost of making the code up to twice as long, to a point where your iteration count for matching/reflections/occlusion is limited by the incompetence of your compiler, insisting on unrolling all loops, running out of memory.
https://www.shadertoy.com/view/ls2BDc
shows how a single near-torus-shaped bounding volume increases fps a lot.
no, a gpu does not take both sides of an if() branch, unless you compute both outcomes before the branch explicitly.
What openGL does is pausing the computation (or skip any effects of it) on every fragment where a condition is FALSE until all fragments were a condition is true are computed, then all fragments that where true get paused (or they skip all eccects of the false-branch), waiting for the false-branched fragments to do their false-contitional-thing.
You have like 640*480 fragments but only like 16 GPU cores, so there is a queue to wait in anyways. an if()condition simply has fragments "skip in line" and you barely notice the difference and it may as well be "parallel" as if there was no skipping in line.
only AFTER the false-fragments have been processed, the fragments get updated. prior to that update, the dfdx() and dfdy() functions point at a fragment-state prior for the if-branch.
the above is part of synchronization, it is simpler in opengl, as you are not given much control over it, but VERY complicated in openCL and vulkanAPU, where more details MUST be made explicit.
fear my nonsense:
A single pass errosion can be emulated with some multi octave fbm, multiplied by a (tiled) voronoi. it is commonly used to do lava flowing around rocks or ice sheets. with simpler noise, but the right height-map noise and exponential functions with different exponents per octave may make some nice rivers out of that.
---
domain warping in raymarching is simple. twist() is the most common, easily found on shadertoy.com, to turn a cylinder into a helix. large lipschitz constants are almost guaranteed.
bending a cylinder or box around a point that is nowhere near the surface of the cylinder is also quite common.
all domain warps come down to a 3d to 3d transformation matrix that is very non-uniform.
it quickly ends up becoming a (kaleidoscopic) iterated function system fractal, with complex number transformations.
changing the ray direction during ray marching is NOT how do do it. because inverse-warping the set is equal to warping the ray.
what you do is, convert yout transformation into a mat3() and then do inverse(mat3())
----
8 pages late;
hornet, your break condition for the raymarching loop is heuristic, meaning, sphere tracking approaches a surface, but will unlikely ever hit it (or below it), it will however get very close to the surface relatively quickly.
heuristics is not about getting the EXACT result, but quickly estimating a close enough result after arbitrarily defining what you consider to be "close enough"="epsilon>0." ; "close enough" means; if (abs(a)<epsilon){printf "close enough"; break} else continue
printf "trying to get closer";
There is a HUGE temptation to set epsilon==0. which completely misses the point of abs(a) never reaching ==0. , just like Achilles turtle race, it takes too long to have the precision get too small to round a to exactly 0.
----
bloodnok
i am working on the pony shaders.
made a cherileee cutiemark
i want them to be better than meatballs, making things tricky.
bloodnok
for mixing different types of geometry, you want to do "deferred shading"
in general, deferred shading generates multiple fullscreenmaps (often at quater resolution, from different sources, and mixes them together in a final composition with a few min() and max() functions and usually some gaussian blur.
its advantage, besides mixing different sources, is to delay the trickier functions, like sqrt) as long as possibly, by only storing squared distances in the buffered images = dot(a,a), instead of length(a)= sqrt(dot(a,a));
the most recent unity-engine has multiple plugins for deferred shading, basically turning meshes into parametric signed distance functions, to be displayed and mixed as such in real time. sadly this is mostly Japanese and not localized well, because their motivation is to mix MMD meshes with raymarching geometry and effects (like fractals, cloth, clouds, erosion and thermodynamics)
---
7 pages late:
Tigrou
simple solution is, all rays have the same origin, the focal point of a camera lens. which is very arbitrary, best set to rayorigin=focalpoint=vec3(0);
and the ray direction = vec3(fragmentpos.xy,orthogonalDistancefScreenRayOrigin)-rayorigin;
this automatically turns the z.axis into the "distance to view screen" axis, using the same domain namespace as zNear and zFar.
ɧคɾɗվ
your cone disatance function must be wrong.
your error is likely; not using a good estimate for euclidean distance in you distance function.
length(a)=sqrt(dot(a,a))=pow(dot(a,a), 0.5),exp(log(a.x*a,x+a.x*a.x+a.z*a.z)*0.5);
//exp() and log() are inverses exponential functions of the same base, best baseE=base2.71... , base2 can be faster and more precise, but will constrain your maximum range a lot more.
manhattan-taxicap distance is faster than exp(log()) on some systems, but it is insufficient for raymarching/spheretracking
; so my guess is that your sqrt() is not precise enough.
ɧคɾɗվ
same answer i wrote to bloodnok above:
"deferred shading" and a fuckton of "uvw mapping" is your term. this is a common geography problem. making a plane that is warped in 3d space to a planar 2d plane, and its inverse function. the problem only gets trickier when you enter low earth orbit with high enough speeds to stay in orbit, you angles must be precise or you waste fuel, therefore nasa has all your solutions to good plane-to-sphere mapping (and its inverse). not kidding. earth is not spherical and its gravitational field is not uniform. this shit is tricky!
---
3 pages late
you are confusing BV(H) with quad-trees, and all the other hierarchical structures.
you do not even need the hiearchy part of a BVH, a single simple bounding volume for each distance field can easily double the frame-rate most times. hoiierarchy iterations get diminishing returns, up to 10x max performance:
https://www.shadertoy.com/view/XlVXDz
BV(H) is faster on older hardware, as it requires a lot of if() branches which are exponentially worse for newer hardware.
BV(H) is significantly faster on higher resolution (divided by number of total fragments), because you likely have more fragments that miss a BV(H) than fragments that hit a BV(H), increasing exponentially over distance, depending on your field of view.
BVH big shortcoming is that it is much more static, near impossible to recompute a complete BVH 60 times per seconds with dynamic assets, involving a complex vertex shader to do that job, unlikely worth it, unless the BVH is (almost completely) static.
tree-tessellation is faster on newer hardware.
tree-tessellation is better for somewhat uniform patches, like lakes, forests, settlements, that tent to be shaped like voronoi cells or birdoid clusters.
quad-tree-tessellation is more easily done dynamically.
-----
cupe should try what i think that omid is talking about:
A (voxel)traversal loop, that contains a raymarching loop.
this is not even deferred shading, its just that the context/wrapper for raytracing is a traversal loop.
while(tracing) everything inside of OTHER tiles is completely ignored, as if it does not exist. the distance field is cut off outside the tile you are marching inside of.
as proof of concept for the below
https://www.shadertoy.com/view/4sB3D1
and
https://www.shadertoy.com/view/XsXBDX
have a traversal-loop( with a Tracing-loop() inside of it).
it traces conic trees inside of non uniform space filling 2d tiles that have <2 trees per tile.
within a tile, it only traces a single cone, then (most likely) traverses on to the next tile (assuming no other break condition is met).
this is just a slightly different variant of a similar concept; marching within voxels.
march(){
for(){} //your raymarching loop in here, or raytrace a simple shape, whatever.
//the fun part is that the traverse() function below constrains size and complexity of this distance field or the number of things to trace.
either hits a surface (within the tile), or exits the tile-boundary or runs out of steps (within the tile)
}
traverse(){
for(){
//your traversal loop here
// can be simple 2d tessellation, hexagonal, voxel.
//can be a 3d honeycomb
//may or may not be quadtree or octee.
// is NOT a bounding volume HIERARCHY though, no reason, i just don't like them enough yet.
march();//this contains the march() loop above!!!
}
this outer traversal-loop either
runs out of traversal-steps
or goes over zFar (which is measured in voxels AND/OR distance to camera)
or has gotten the a return-code from march() that a surface has been hit.
or may have gotten return-code from that march() ran out of march-steps, in which case it will just continue marching where it stopped by +1 iteration trough the same tile, wasting -1 traversal-steps within the same tile.
}
ideas are cheap. implementing them is the tricky challenge where you meet your limits.
and coding is relatively easy, just takes time. the most tricky part is PR and marketing, because there you deal with a lot of irrational and unpredictable behavior, requiring improvisation.
---
2 pages late:
tompk "I guess the last useful post you are referring to is by ollj O_o"
i dont think my posts are THAT useful, but i assume no sarcasm.
oh my so many dumb typos on page 70.
and so many ideas of mine that i did not develop much, just archived.
---
Tigrou
yes, you can distribute bounding box computation between cpu and gpu, and let the cpu focus on what is within the box boundaries. its ideal deferred shading and it is quite famous among 101 of openCL computing.
this is tricky and bulky though, and in the long run, diminishing returns will increase your energy cost. this is useless for mobile devices, due to extra memory required for all the instructions. unless you do crowd computing with it, then go for openCL implementing just that. many shitty xboxes progressed on a lot of crowd computaton since 2003 this way.
in the long run, your bottleneck is the pipeline between gpu and cpu memory, to a point where VulkanAPI is your only option to do this in real time, where this bottleneck is mostly solved, or at least 50* wider open than in openGL.
yes, bounding boxes on the gpu increase fps (especially for older hardware) at a cost of lowering compile time and at a cost of making the code up to twice as long, to a point where your iteration count for matching/reflections/occlusion is limited by the incompetence of your compiler, insisting on unrolling all loops, running out of memory.
https://www.shadertoy.com/view/ls2BDc
shows how a single near-torus-shaped bounding volume increases fps a lot.
no, a gpu does not take both sides of an if() branch, unless you compute both outcomes before the branch explicitly.
What openGL does is pausing the computation (or skip any effects of it) on every fragment where a condition is FALSE until all fragments were a condition is true are computed, then all fragments that where true get paused (or they skip all eccects of the false-branch), waiting for the false-branched fragments to do their false-contitional-thing.
You have like 640*480 fragments but only like 16 GPU cores, so there is a queue to wait in anyways. an if()condition simply has fragments "skip in line" and you barely notice the difference and it may as well be "parallel" as if there was no skipping in line.
only AFTER the false-fragments have been processed, the fragments get updated. prior to that update, the dfdx() and dfdy() functions point at a fragment-state prior for the if-branch.
the above is part of synchronization, it is simpler in opengl, as you are not given much control over it, but VERY complicated in openCL and vulkanAPU, where more details MUST be made explicit.
Ollj:
Can you please elaborate more on this one?
I especially wonder how twist() comes down to a "very non-uniform" transformation matrix. Also, what do you mean by "it quickly ends up" KIFS.
I need a clear step-by-step explanation here.
Quote:
twist() is the most common, easily found on shadertoy.com, to turn a cylinder into a helix. large lipschitz constants are almost guaranteed.
all domain warps come down to a 3d to 3d transformation matrix that is very non-uniform.
it quickly ends up becoming a (kaleidoscopic) iterated function system fractal, with complex number transformations.
Can you please elaborate more on this one?
I especially wonder how twist() comes down to a "very non-uniform" transformation matrix. Also, what do you mean by "it quickly ends up" KIFS.
I need a clear step-by-step explanation here.
ollj, this skipping of computation happens (at least on nvidia hardware) exactly by setting a write mask after which all of the threads execute both branches, just with writes disabled for some of them. Unless all of the threads agree on the branch to take, in which case that's obviously the only one that gets executed.