Custom Virtual Machines in Intros
category: code [glöplog]
Greetings!
Does anyone have experiences about implementing demos with custom bytecode? Atleast Rose Shank uses this approach succesfully in a 4k and XMunkki even finds that the "the final crunch was more adding features than churning down the size"
I'm working on an embedded DSL for graphics/demos, something in the spirit of GPipe but with a more Renderman than OpenGL inspired approach. The intermediate language is based on lambda calculus with de Bruijn indices and is describes the visuals without implementation details. The idea is to have same program to be compileable to multiple backends (CPU, OpenGL, WebGL, cluster...) by changing only the interpreter. One backend could be for tiny intros.
OpenGL based implementation would essentially be a string generator with the following assumed advantages:
The interpreter might turn out to be complex (for 4k), but I'm hoping to implement something like this in time for next Instanssi. Any feedback about this topic is greatly appreciated, as are general ideas and discussion about virtual machines in demos.
Thanks!
Does anyone have experiences about implementing demos with custom bytecode? Atleast Rose Shank uses this approach succesfully in a 4k and XMunkki even finds that the "the final crunch was more adding features than churning down the size"
I'm working on an embedded DSL for graphics/demos, something in the spirit of GPipe but with a more Renderman than OpenGL inspired approach. The intermediate language is based on lambda calculus with de Bruijn indices and is describes the visuals without implementation details. The idea is to have same program to be compileable to multiple backends (CPU, OpenGL, WebGL, cluster...) by changing only the interpreter. One backend could be for tiny intros.
OpenGL based implementation would essentially be a string generator with the following assumed advantages:
- The IR-representation should be compact due to it's high level (and high order) nature.
- OpenGL could be treated as a general purpose compute language (with imageLoad/Store), allowing custom rasterization for example.
- Easy coding, map (* 0.5) img compiles to pixel shader dimming the image. Multipass effects are function compositions.
- Free optimizations as the graphics driver essentilly is a compiler.
- Necessary shader string generator might also possibly be used to compress external symbol imports
The interpreter might turn out to be complex (for 4k), but I'm hoping to implement something like this in time for next Instanssi. Any feedback about this topic is greatly appreciated, as are general ideas and discussion about virtual machines in demos.
Thanks!
didn't the mercury guys do something like this at some point?
I believe Blueberry's amiga 4ks works that way (the loonies ones).
I did this as well in my first days of 4k coding (and yes, cheers to las of mercury for the idea). I had an array of function pointers, an array of integers representing the amount of float arguments each function requires, and each opcode was a function accepting a float* as an argument. I had a small loop for: 1) reading the next opcode from the opcode array, 2) calling the opcode's function with the float array pointer as argument, 3) incrementing the opcode pointer and 4) advancing the float array pointer by the number of parameters the opcode needed. My intro data was correspondingly divided in opcodes (unsigned chars) and opcode data (signed chars, converted to float and scaled correspondingly) for better compression.
I had following opcodes(argument amount), amongst others:
- Matrix functions:cameraPosition(3), push(0), popMatrix(0), scale(3), translate(3), rotate(3);
- Drawing functions: drawMesh(1)
- Flow control: jump(1), jumpGreaterEqual(2)
- Arithmetic: add(2), sub(2), mul(2), div(2)
My intro worked with 255 floating point registers, I used a special escape byte (255) in the floating point data for accessing the register's value: once the interpreter encountered a 255 in the data array, it read the next position and interpreted it as a register index, then used the register's content as the required opcode argument.
I hope this brief summary was helpful.
I had following opcodes(argument amount), amongst others:
- Matrix functions:cameraPosition(3), push(0), popMatrix(0), scale(3), translate(3), rotate(3);
- Drawing functions: drawMesh(1)
- Flow control: jump(1), jumpGreaterEqual(2)
- Arithmetic: add(2), sub(2), mul(2), div(2)
My intro worked with 255 floating point registers, I used a special escape byte (255) in the floating point data for accessing the register's value: once the interpreter encountered a 255 in the data array, it read the next position and interpreted it as a register index, then used the register's content as the required opcode argument.
I hope this brief summary was helpful.
There's a talk from Blueberry where he describes a VM like that.
xTr1m: If your single opcodes are short you're probably better off having a chain of "loop" instructions to jump between the handlers (with a jmp to the starting point of the loop at the end), and advance the data pointer manually. Both things take less bytes in x86 asm and pack quite well. :)
xTr1m: If your single opcodes are short you're probably better off having a chain of "loop" instructions to jump between the handlers (with a jmp to the starting point of the loop at the end), and advance the data pointer manually. Both things take less bytes in x86 asm and pack quite well. :)
Make the lenght of those opcodes inversely proportional to their use in a program.
And never forget that stack machines are superior to register machines when it comes to code density!
kb_: Thanks for pointing Blueberry's talk, fascinating to know that they use a bytecode interpreter as a common part of their workflow instead of just a proof of concept.
xTr1m: Pleasantly detailed post, cheers. What horror did you find to select some other technique in your later work?
The opcodes seem well suited for general purpose graphics computations, however I was hoping to get along with fewer general purpose opcodes. As only program flow related code is executed on the CPU and math is lifted behind the shader compiler, the form for math instructions for example would be in the lines of:
- unaryFun f x
- unaryOp f x
- binaryFun f x y
- binaryOp f x y
where f denotes GLSL function name as string constant. This is just talk though, I have absolutely no idea what the end result will require (but will post results as they appear).
As for viznut's reminder of stack machines, the lambda calculus terms in de Bruijn form already encode variables by their recursion depth and this should be actually easier to implement than a register machine.
Perhaps it is time to actually code something. Thanks for the discussion so far!
xTr1m: Pleasantly detailed post, cheers. What horror did you find to select some other technique in your later work?
The opcodes seem well suited for general purpose graphics computations, however I was hoping to get along with fewer general purpose opcodes. As only program flow related code is executed on the CPU and math is lifted behind the shader compiler, the form for math instructions for example would be in the lines of:
- unaryFun f x
- unaryOp f x
- binaryFun f x y
- binaryOp f x y
where f denotes GLSL function name as string constant. This is just talk though, I have absolutely no idea what the end result will require (but will post results as they appear).
As for viznut's reminder of stack machines, the lambda calculus terms in de Bruijn form already encode variables by their recursion depth and this should be actually easier to implement than a register machine.
Perhaps it is time to actually code something. Thanks for the discussion so far!
There have been experiments using the forth language/virtual machine for 4k intros already:
Worth reading for some inspiration:
http://neoscientists.org/~plex/win4k/index.html
Worth reading for some inspiration:
http://neoscientists.org/~plex/win4k/index.html
We used a very simple bytecode using a autogenerated calltable - searching "ret" is cheap, but dangerous. Each bytecode incremented the "program counter" itself - using esi and lodsb...
Something like - (by searching from _vmcore for 0xC3 (ret) using lodsb it's simple to build a calltable))
Our machine had 256 float registers - some of them special purpose (e.g. time or some registers directly setting shader parameters).
Some ops:
call/ret, loop, mov(reg/reg, reg/immediate(0..255)), push/pop, swizzle, math ops, matrix stack ops, mesh placement ops...
Something like - (by searching from _vmcore for 0xC3 (ret) using lodsb it's simple to build a calltable))
Code:
; vm core loop - opcode 0 = exit
_vmcore:
xor eax,eax
lodsb
dec eax
js _vmhalt
call dword ptr [vmcalltable+eax*4]
jmp _vmcore
_vmhalt:
ret
_vmop0:
lodsb ; reg0 index
[...]
lodsb ; reg1 index
[...]
ret
_vmop1:
lodsb ; reg0 index
[...]
ret
Our machine had 256 float registers - some of them special purpose (e.g. time or some registers directly setting shader parameters).
Some ops:
call/ret, loop, mov(reg/reg, reg/immediate(0..255)), push/pop, swizzle, math ops, matrix stack ops, mesh placement ops...