Size limited coding vs new shader models (Is ray-tracing even possible in 64k or below?)
category: code [glöplog]
It's the end of 2020, kind of a rough year, and I have some time on my hands so I thought I'd relax by starting work on a new 64k engine - after 10 years on the current one it's about time.
Naturally I started looking into dx12 to be able to utilize all that new hardware has to offer, but ran into a big BUT, in the form of shader compilation.
Turns out that as of shader model 6.0(ish), the shader compiler is not actually part of the directx runtime and is required to be distributed along with applications that want to do their own shader compilation. We can use the old compiler to build stuff up to shader model 5.1, but that doesn't support ray tracing or mesh shaders or any of the new stuff.
A cursory glance at Vulkan showed the same problem: apparently you either supply the shader compiler yourself or you supply precompiled shaders.
The compilers obviously are way above the common size limits.
I'm wondering what the stance of the scene is on this issue.
Am I missing something?
Will we allow the shader compilers to still be present on the compo machines (as we did when the directx redist still contained them)?
Or will we have to skip out on newer API features in size limited applications (unless some way of including binary shaders at reasonable compression rates is found)?
What do you guys think?
Naturally I started looking into dx12 to be able to utilize all that new hardware has to offer, but ran into a big BUT, in the form of shader compilation.
Turns out that as of shader model 6.0(ish), the shader compiler is not actually part of the directx runtime and is required to be distributed along with applications that want to do their own shader compilation. We can use the old compiler to build stuff up to shader model 5.1, but that doesn't support ray tracing or mesh shaders or any of the new stuff.
A cursory glance at Vulkan showed the same problem: apparently you either supply the shader compiler yourself or you supply precompiled shaders.
The compilers obviously are way above the common size limits.
I'm wondering what the stance of the scene is on this issue.
Am I missing something?
Will we allow the shader compilers to still be present on the compo machines (as we did when the directx redist still contained them)?
Or will we have to skip out on newer API features in size limited applications (unless some way of including binary shaders at reasonable compression rates is found)?
What do you guys think?
Some thoughts (with some big grains of salt - I haven't dug in to this very concretely yet, but I've been thinking about it tangentially and I don't think you're missing something - I had the same overall impression):
I think an engine designed with today's hardware in mind needs to handle compositions of objects/scenes/effects that have non-local effects on logic on both CPU and GPU sides, and a traditional OOP approach doesn't really work anymore. It's difficult to predict how these features should be composed and I think it's unlikely that they'll be composed in the same way for each effect, and certainly not for each demo! So I think one has to be quite careful when designing an "engine"/abstraction over the hw so as not to block one's self in. At the same time, shader compilation seems to be something we have to deal with in order to not add yet another dependency besides OS/drivers. I think both of these problems can/should be addressed simultaneously at the engine architecture/tooling level.
More concretely in our case, I've been looking into generating CPU/GPU code with a compiler in the tool (that's project-defined) so that we have one representation that works well for authorship/composing things where we can extend/adjust the scene graph nodes and representations thereof, but just as importantly, allows us to have a totally separate low-level runtime code/representation. Obviously this is nothing novel (perhaps a bit space cadet admittedly because I like working with compilers), but I think it's one of many possible approaches. Specifically related to shader compilation is that we have a choice to make with this approach: would we ship the generated shader code (perhaps compiled to binary) or would it make more sense to take advantage of this approach and see if we can compress things better at a higher semantic level, and ship a very basic low-level shader JIT as part of the intro runtime (similar to what we do with x86 code today *ignores x64-shaped elephant in the room*)? The best thing would be to try both (ofc it's a spectrum and not just two ends as well) and compare but free time is limited (and getting more and more so as years go by) so we'll likely pick one and run with it (and I'm not sure which way I'm leaning yet).
I'm not that familiar with SPIR-V compression approaches (nor am I very familiar with the format itself yet) but it seems to be a bit bloaty and I'm guessing a different representation + decoder would likely pack better than trying to build models that perform well on SPIR-V data.
I think an engine designed with today's hardware in mind needs to handle compositions of objects/scenes/effects that have non-local effects on logic on both CPU and GPU sides, and a traditional OOP approach doesn't really work anymore. It's difficult to predict how these features should be composed and I think it's unlikely that they'll be composed in the same way for each effect, and certainly not for each demo! So I think one has to be quite careful when designing an "engine"/abstraction over the hw so as not to block one's self in. At the same time, shader compilation seems to be something we have to deal with in order to not add yet another dependency besides OS/drivers. I think both of these problems can/should be addressed simultaneously at the engine architecture/tooling level.
More concretely in our case, I've been looking into generating CPU/GPU code with a compiler in the tool (that's project-defined) so that we have one representation that works well for authorship/composing things where we can extend/adjust the scene graph nodes and representations thereof, but just as importantly, allows us to have a totally separate low-level runtime code/representation. Obviously this is nothing novel (perhaps a bit space cadet admittedly because I like working with compilers), but I think it's one of many possible approaches. Specifically related to shader compilation is that we have a choice to make with this approach: would we ship the generated shader code (perhaps compiled to binary) or would it make more sense to take advantage of this approach and see if we can compress things better at a higher semantic level, and ship a very basic low-level shader JIT as part of the intro runtime (similar to what we do with x86 code today *ignores x64-shaped elephant in the room*)? The best thing would be to try both (ofc it's a spectrum and not just two ends as well) and compare but free time is limited (and getting more and more so as years go by) so we'll likely pick one and run with it (and I'm not sure which way I'm leaning yet).
I'm not that familiar with SPIR-V compression approaches (nor am I very familiar with the format itself yet) but it seems to be a bit bloaty and I'm guessing a different representation + decoder would likely pack better than trying to build models that perform well on SPIR-V data.
Meanwhile, Metal supports raytracing, the shader compiler is built into the OS, and the API is so much more nicely designed I’m able to do 4k exegfx without an exe packer.
It’s like 30 lines of code or something to set up a fullscreen window, a compute function and a render loop. The only thing missing really is a crinkler equivalent.
It’s like 30 lines of code or something to set up a fullscreen window, a compute function and a render loop. The only thing missing really is a crinkler equivalent.
I have worked extensively on shader compilers targeting both DXIL and SPIR-V, and I would say that emitting SPIR-V from a high level IR isn't too much work. The representation is relatively straight forward, and has a lot of neat details that can let you do simple "half compilers" without too much hassle.
Sadly, the situation is quite different for DXIL, where you both need very complicated LLVM 3.7 bitcode emission as well as signing the shaders posr-compile (which either require DXIL.dll which currently don't ship with windows in a well defined location, or reverse engineering the proprietary digest). DXIL also isn't very compressor friendly, and does a ton of variable length bitfields. And you can't use DXBC for Shafer Model 6.
I guess the TLDR here is that I think it's viable to use Vulkan for 64ks without shipping a (large) compiler somehow. But I think it's less so for D3D12. It's a bit sad.
If you're interested, I can share more information. I'm probably the only person who have successfully written a DXIL targeting shader compiler outside of Microsoft's DXC project (which is actually an LLVM fork, so they didn't even need to emit the LLVM bitcode). It's doable, but I wouldn't have startef from scratch if I didn't have to.
Sadly, the situation is quite different for DXIL, where you both need very complicated LLVM 3.7 bitcode emission as well as signing the shaders posr-compile (which either require DXIL.dll which currently don't ship with windows in a well defined location, or reverse engineering the proprietary digest). DXIL also isn't very compressor friendly, and does a ton of variable length bitfields. And you can't use DXBC for Shafer Model 6.
I guess the TLDR here is that I think it's viable to use Vulkan for 64ks without shipping a (large) compiler somehow. But I think it's less so for D3D12. It's a bit sad.
If you're interested, I can share more information. I'm probably the only person who have successfully written a DXIL targeting shader compiler outside of Microsoft's DXC project (which is actually an LLVM fork, so they didn't even need to emit the LLVM bitcode). It's doable, but I wouldn't have startef from scratch if I didn't have to.
optimism:
just wait 10 more years until WebGPU add support hardware raytracing API
realism:
keep writing GLSL 1.1 shaders next 10 years
just wait 10 more years until WebGPU add support hardware raytracing API
realism:
keep writing GLSL 1.1 shaders next 10 years
Quote:
64ks without shipping a (large) compiler
I haven't looked into the latest graphics APIs, so forgive me if this is a stupid question, but why is it necessary to ship a compiler at all? Is the compiled representation too bloated for 64k?
> but why is it necessary to ship a compiler at all
I have no idea either
include precompiled spir-v always be smaller than including whole GLSL compiler and glsl shaders.
only tools like Shader-editor have to include compiler, but for tool size does not matter... so idk
I have no idea either
include precompiled spir-v always be smaller than including whole GLSL compiler and glsl shaders.
only tools like Shader-editor have to include compiler, but for tool size does not matter... so idk
Quote:
>include precompiled spir-v always be smaller than including whole GLSL compiler and glsl shaders.
Actually no - text shaders are usually smaller (especially as you increase complexity) and are low-entropy data (since they're text-only, especially minified) so they compress better, plus you can do stuff like stitching to get more mileage out that you couldn't do with binary.
Quote:
Actually no
actually Yes
all my demos on pouet in my own Vulkan launcher that has size 26Kb(exe only, 27Kb for Linux build) that also opensource
that launcher launch my shaders that has size up to 1Mb SPIRV compiled
and to not have 1Mb SPIRV I compress that SPIRV from 500kb-1Mb to 20-50Kb size
and total size of EXE become 26+50=76Kb
my largest shader is here https://www.pouet.net/prod.php?which=84806 there literally almost 1Mb size of compiled SPIRV compressed to 50Kb
second big is https://www.pouet.net/prod.php?which=85052 there ~500Kb SPIRV compressed to 30Kb
every tool I use is opensource and my own laucher code also
Quote:
text shaders are usually smaller (especially as you increase complexity) and are low-entropy data
source code of GLSL shader is ~50Kb(as text)
zip compressed this shader to 12Kb
compressed SPIRV of that shader is 50Kb (zip not used, and it not compress is smaller from this 50kb)
Yes even my compressed SPIRV size is ~5x times bigger than soruce code of shader.
But this is Vulkan, and we have to accept it.
including compiler to GLSL Vulkan demo to make it "smaller" has point only when that demo use ALOT of shaders (like 50+ shaders with unique text)... for small demo that use just les than 10 shaders like I do, compiler does not make it smaller.
Just tested this with Metal, using my path tracer shader (source) for my next exegfx:
Compiled: 26kb
Compiled & zipped: 18kb
Source: 25kb
Minified source: 8kb
Zipped minified source: 3.7kb (the executable ends up under 4KB)
If you want to size code these things, compiling the shaders at runtime is a massive win and any platform / API that doesn't include the compiler is going to be at a big disadvantage.
Compiled: 26kb
Compiled & zipped: 18kb
Source: 25kb
Minified source: 8kb
Zipped minified source: 3.7kb (the executable ends up under 4KB)
If you want to size code these things, compiling the shaders at runtime is a massive win and any platform / API that doesn't include the compiler is going to be at a big disadvantage.
Quote:
which either require DXIL.dll which currently don't ship with windows in a well defined location, or reverse engineering the proprietary digest
I don't know how bad the bitcode emission is, but using a reverse engineered digest probably isn't much worse than the hacks already used by PE compressors.
Quote:
DXIL, where you both need very complicated LLVM 3.7 bitcode emission as well as signing the shaders posr-compile (which either require DXIL.dll which currently don't ship with windows in a well defined location, or reverse engineering the proprietary digest)
Would it be a problem to let DXIL.dll sign your minified DXIL code? Of course you would have to store the (uncompressable) sign itself then, dunno how big that is (and how many you really need, not everything has to be SM6). And then I have no idea how much we can get out of minifying+transforming DXIL.
It's unfortunately hard to see which default installed program in windows would need the compiler, apart from Edge in 10 years as Danilw hints ;)
The problem seems to be that DXIL doesn't compress well, and that it would be better to generate it at runtime from source code or a more suitable bytecode representation. If you just distribute the compiled and signed DXIL, you don't need a compiler or signing at runtime, but it takes more space than distributing minified source code like people are used to with older graphics APIs.
absence: yes, but you can still compress it a lot better than by just throwing it at your default execruncher. You would have 2 layers of transformation, outer one still producing valid (but minified/regularized) DXIL (needing signing) and inner one that needs to be detransformed at runtime. and THEN you would throw the transformed version ("more suitable representation") at the execruncher.
But yes, it looks quite bad at a glance https://github.com/microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst#introduction
Wonder how much we can hook into the pipeline. Ie compiling sm5 shaders from hlsl and then patching up the resulting DXIL... probably doable with the new intrincis, and less so with the new shader types.
But yes, it looks quite bad at a glance https://github.com/microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst#introduction
Wonder how much we can hook into the pipeline. Ie compiling sm5 shaders from hlsl and then patching up the resulting DXIL... probably doable with the new intrincis, and less so with the new shader types.
Quote:
I guess the TLDR here is that I think it's viable to use Vulkan for 64ks without shipping a (large) compiler somehow. But I think it's less so for D3D12. It's a bit sad.
I haven't looked at the new API yet, but my intuition[1] would be:
- High-level shader source tends to compress well, especially so considering the cases when one needs to generate many shader variants.
- A minimal shader compiler might be something doable for a reasonably small binary footprint, and we might even see one emerge just like we have synthesizers.
- Since we already have "static" shader source transformations like minifying, we could have other transformations to do the heavy lifting so even a naive compiler would generate decent shader code.
[1]: My experience with size-coding also tells me that intuition isn't a good indicator. :)
Zavie
I post my links already with an actual 26Kb exe Vulkan shader launcher(not include shader size)(that does not have validation errors and is crossplatform)
if you interested you can check it
I post my links already with an actual 26Kb exe Vulkan shader launcher(not include shader size)(that does not have validation errors and is crossplatform)
if you interested you can check it
It seems a bit pointless to compare a 'shader launcher' alone, since the shader size is often the real problem when it comes to size code. Especially when anything competitive needs to be capable of <4KB with the 'shader launcher', shader(s) and music. And 26kb for just the base code, no shaders, isn't going to be competitive in the 64k compo...
at least we now have a good excuse that API innovation is killing the demoscene... (wait, didnt we say this before?) :D
alia if you talk about "that 4kb" OpenGL shader launcher code that works only on Nvidia and hardly depends of WinAPI... I think this is a perfect example of a broken piece of software
(that 4kb launcger not work under Wine as example, and every new Windows(10+) versions will ruin those demos that use that broken code)
as I say - my launhcer code do not have "any validation error and is crossplatform"
"minimal broken Vulkan shader launcher code for WinAPI" also has size about 4Kb(exe) but it also works only under Nvidia (code size in text is less than 100 lines of code)
I can link it if you want, code is not my but it opensource (and it "bad" to use this code... it literally work only on Nvidia(with few drivers versions exceprions)
(that 4kb launcger not work under Wine as example, and every new Windows(10+) versions will ruin those demos that use that broken code)
as I say - my launhcer code do not have "any validation error and is crossplatform"
"minimal broken Vulkan shader launcher code for WinAPI" also has size about 4Kb(exe) but it also works only under Nvidia (code size in text is less than 100 lines of code)
I can link it if you want, code is not my but it opensource (and it "bad" to use this code... it literally work only on Nvidia(with few drivers versions exceprions)
god you're even worse than stefan! probably with half his skills even! :D unless your minimal broken compiles a string of shader source at runtime without pre-SPIRVing it and also excluding the (bloated) glslang/shaderc magic which would render (hihi, pun intended) doing a 4K pretty bloody useless
+launcher somewhere
That, who gives a shit about launchers? We’re here to make demos. (And I’ll have a pair of macOS 4K exegfx out tomorrow, 4kb including the launcher and shader, imagine that! Although not using the ray tracing APIs... I think it’s possible in 4kb)
who needs RTX at all?
Haven´t seen a 4K using any Vertex-Shader (models made from Polygons/Vertices) for many years! ;)
For 64Ks this would maybe be sth nice to have, but then: if we can raytrace in 4K, where would be the problem of doing so in 64K?
(I am obviously not talking about raymarching(spheremarching) here, but more of sth like this from 2012: Piscine Parfait by Archee)
Haven´t seen a 4K using any Vertex-Shader (models made from Polygons/Vertices) for many years! ;)
For 64Ks this would maybe be sth nice to have, but then: if we can raytrace in 4K, where would be the problem of doing so in 64K?
(I am obviously not talking about raymarching(spheremarching) here, but more of sth like this from 2012: Piscine Parfait by Archee)
Wouldn´t using RTX for raytracing be just as lame as using DrawSphere() from DirectX/OpenGL ?!
Bad enuff we need to use these APIs to get access to some reasonable drawing-surface/shaders in the first place!
Ok, maybe i shouldn´t be too mad about this, as shaders rule, without we would still be doing the same stuff as in 2000 i guess! (but Even with 3GHz my last CPU-raymarcher was very slow compared to shader-marchers! Or was it 12GHz, if it was a QuadCPU and marcher used MultiThreading? Nope, i don´t think so! But should be even more power in same time!)
Bad enuff we need to use these APIs to get access to some reasonable drawing-surface/shaders in the first place!
Ok, maybe i shouldn´t be too mad about this, as shaders rule, without we would still be doing the same stuff as in 2000 i guess! (but Even with 3GHz my last CPU-raymarcher was very slow compared to shader-marchers! Or was it 12GHz, if it was a QuadCPU and marcher used MultiThreading? Nope, i don´t think so! But should be even more power in same time!)