[code] Particle Systems
category: general [glöplog]
OK, so I know that it is wise to use a SOA instead of an AOS. But has anybody an advice if its better to use normal arrays, std::list, or a selfmade allocator to store particle data (pos, vel, etc.)? Especially considering 1) the killing of random particles and 2) cache optimized (SIMD) routines for VectorArrays...
Any thoughts would be welcome!
Any thoughts would be welcome!
What is SOA or AOS?
arrays should be good enough. dunno what you mean by killing random particles since every particle should be somehow organized :) "kill" it when its life goes to zero?
+ i recommend SOA, although no idea what it means.
+ i recommend SOA, although no idea what it means.
Using std::list for particle systems is the worst decision you could make. Well, not the most, you could be using std::map.
array or std::vector.
and removing particles in the middle of an array is trivial, because particles don't have to be stored in any particular order (even when you zsort them or anything, you need to do that per frame anyway):
and removing particles in the middle of an array is trivial, because particles don't have to be stored in any particular order (even when you zsort them or anything, you need to do that per frame anyway):
Code:
for(i=0;i<count;)
if(!particle[i].expired) {
// process particle
i++;
}
else
particle[i] = particle[--count];
Actually you can have two pointers (indices), one for reading and one for writing. e.g.:
for(i = 0, j = 0; i < activeparticles; i++) {
if (!particle[i].expired) {
// read data for particle i and write to particle j
j++;
}
}
activeparticles = j;
it's more or less the same as above.
for(i = 0, j = 0; i < activeparticles; i++) {
if (!particle[i].expired) {
// read data for particle i and write to particle j
j++;
}
}
activeparticles = j;
it's more or less the same as above.
This may help.
i think soa can be poison for your cache... at least if you avoid dynamic allocation. same applies if you use some kind of own memory management like illustrated in ryg's lightspeed seminar. personally i prefere a normal well aligned aos since its easier to operate on single items (delte, reorder,...).
first, for the "uninitiated": aos=array of structures, soa=structure of arrays.
aos:
soa:
in practice, you usually do something like "aosoa": (or atleast that's what i tend to end up with)
the idea behind soa is to make data more SIMD-friendly - vector adds etc are no problem either way, but things like dot or cross products are not particularly fast when you want to do them between individual components of a single SIMD register (lots of data movement/swizzling involved). the "aosoa" variant groups data into "SIMD-friendly" chunks but has nicer locality in case you also do a bit of random access now and then.
don't think. measure.
as long as you perform sequential operation (...like in a particle system...), working in multiple streams is not a problem (in my experience anyway) - it doesn't make much of a difference whether you have 3 cachelines' worth of xyz or 1 of x,y,z each. nonsequential access is different - there you definitely want tight packing so you don't have to fetch more cachelines than necessary.
pan: matter of taste, mostly. i like my variant a bit more (obviously :)) because it never has to re-read cachelines that have already been processed, even when most items get deleted and the distance between the write and read pointers gets big (even when you write a whole cacheline, it first gets read from memory, unless the memory region is marked as write combined). having said that, i seriously doubt there's any significant difference outside of specially constructed scenarios :)
aos:
Code:
struct particle {
float x,y,z;
} particles[1000];
soa:
Code:
struct particles {
float x[1000],y[1000],z[1000];
};
in practice, you usually do something like "aosoa": (or atleast that's what i tend to end up with)
Code:
struct particle4 {
float x[4],y[4],z[4];
} particles[250];
the idea behind soa is to make data more SIMD-friendly - vector adds etc are no problem either way, but things like dot or cross products are not particularly fast when you want to do them between individual components of a single SIMD register (lots of data movement/swizzling involved). the "aosoa" variant groups data into "SIMD-friendly" chunks but has nicer locality in case you also do a bit of random access now and then.
Quote:
i think soa can be poison for your cache...
don't think. measure.
as long as you perform sequential operation (...like in a particle system...), working in multiple streams is not a problem (in my experience anyway) - it doesn't make much of a difference whether you have 3 cachelines' worth of xyz or 1 of x,y,z each. nonsequential access is different - there you definitely want tight packing so you don't have to fetch more cachelines than necessary.
pan: matter of taste, mostly. i like my variant a bit more (obviously :)) because it never has to re-read cachelines that have already been processed, even when most items get deleted and the distance between the write and read pointers gets big (even when you write a whole cacheline, it first gets read from memory, unless the memory region is marked as write combined). having said that, i seriously doubt there's any significant difference outside of specially constructed scenarios :)
the problem im refering to is that u may need more than 4 cachelines for a particle. think about the acceleration for each particle... your final soa may have something like 6 or more arrays with the same size. if you now choose a bad size they will share the same line set resulting in cache fighting. this is probably a quite uncommen scenario but not impossible at all.
its not about having 3 cache lines, its about possible fighting ones
its not about having 3 cache lines, its about possible fighting ones
what sort of thing do you want to do with particles, because there are many cases where you don't have to store or update anything at all.
ttl - the solution is very simple (and already visible in my short example code :): use counts which don't have high powers of two as divisors. easiest way to make sure you get a usable spread in cache usage.
sure... if sizes are well chosen theres no problem at all.
i just wanted to point out that soas have a really great potential to fuck up the loop ;)
but if you keep that in mind everything should be fine.
i just wanted to point out that soas have a really great potential to fuck up the loop ;)
but if you keep that in mind everything should be fine.
whoa, thanks for all the helpful replies! That cleared up a lot of questions.
I had the same idea in mind like the "swapRemove" in ryg's first post, when the particle has expired.
Will try stuff out when getting home..
I had the same idea in mind like the "swapRemove" in ryg's first post, when the particle has expired.
Will try stuff out when getting home..
ah, just another question: Is there any advantage in rendering particle billboards in screen-space (RHW) instead of using camera-aligned 3d-quads?
regarding SOA/AOS... ryg's idea....
is the way to go, but I've prefered to add a twist:
this way, if you are coding SIMD via automatic paralelisation (such as gcc-4 provides), then you are not tied to todays 4-way SIMD... if there is next processors that use 8-way, 16-way or 32-way SIMD, they will enable to use this automatically. I recall pondering this one when I first started to code a k6 machine with amd 2-way simd and just a bit later I was starting to make powerpc 4-way simd... I noticed that adjusting too much to the machine in this case was not good ;)
Code:
struct particle4 {
float x[4],y[4],z[4];
} particles[250];
is the way to go, but I've prefered to add a twist:
Code:
struct particle4 {
float x[32],y[32],z[32];
} particles[1024/32];
this way, if you are coding SIMD via automatic paralelisation (such as gcc-4 provides), then you are not tied to todays 4-way SIMD... if there is next processors that use 8-way, 16-way or 32-way SIMD, they will enable to use this automatically. I recall pondering this one when I first started to code a k6 machine with amd 2-way simd and just a bit later I was starting to make powerpc 4-way simd... I noticed that adjusting too much to the machine in this case was not good ;)
i just make it one particle per structure and make use of the 4 simd float ops intelligently (e.g. put the rotation in the position.w, put the angular velocity in the velocity.w).
although that would be on ps2/psp, cant remember the last time i did a particle effect on pc :)
although that would be on ps2/psp, cant remember the last time i did a particle effect on pc :)
for "normal" particle motions, that's just fine (and also what we use internally in our particle systems - which aren't even SSE optimized, because they're fast enough).
and in fact, what i said previously wasn't correct: you can do dot products pretty efficiently even in an AOS layout, by always computing N (where N is the vector size of whatever instruction set you're using) dot products at once. dot products with a constant vector are easy, dot products between variable vectors are a bit more involved because you have to perform a matrix transpose operation between SIMD registers, but it's still relatively fast.
and in fact, what i said previously wasn't correct: you can do dot products pretty efficiently even in an AOS layout, by always computing N (where N is the vector size of whatever instruction set you're using) dot products at once. dot products with a constant vector are easy, dot products between variable vectors are a bit more involved because you have to perform a matrix transpose operation between SIMD registers, but it's still relatively fast.
i'd also advise implementing the math primitives you use to generate simd (using for ex. the sse intrinsics on pc) instead of optimizing the separate case(s), costs a little more time, yields speed anywhere you do heavy vectorized floating point processing
And nowadays I might not agree with myself that much ;)
sorry for the kick!
sorry for the kick!
superplek, it's no problem - no-one listens to you anyway! hohoho!
sitting with something you'd like to share, Conan? :)
i agree with Ryg, measure... from my experience there is no "final" solution for particle systems. i tend to measure and try as i go.