algorithm information 172 glöps
- general:
- level: user
- personal:
- first name: Naveed
- last name: Khugiani
- demo Commodore 64 AWE - Algo Wave Extraction by Algotech
- Some "quick" technical notes behind this demo
Lets start off with the encoder
The encoder analyses the complexity of variable chunks per 20/40ms and this determines the chunk length to analyse. This is useful for preserving transients as well.
The encoding process extracts the following data from the chunk
frequency
This is 10bits. however an additional bit to determine if the frequency will traverse backwards. This can get better matches for asymettrical waveforms
phase
a range of 0-255 which is the start offset into the waveform
volume
a range of 0-31. This implies 5bit resolution, however on inspecting the maximum amplitudes, many would not reach the highest peaks. For this reason, the encoding is at 7bit volume resolution per layer (0-127) however only the first quarter of this is used which can then fit into 5bits
waveform
a range of 0-7. any one of 8 waveforms (noise, sine, triangle, pulse, sawtooth, resonance1, resonance2, trisaw)
It does all the above for x amount of layers. In this demo I opted for 8 layers (channels)
When comparing this with fourier transform based encoding, it requires only a quarter of the layers to achieve the same quality.
Now onto the decoder
the lower 3 bits of the 11bit frequency point to the finetuned pitch tables. the upper 8 bits are the main step values. this is incremented all together to form 8 step values which are written to the main decodeloop per layer (to the waveform low bytes)
the phase is merely written to the index per layer
the waveform value is calculated by pointing to the 512 byte waveform (this is 256 bytes repeated twice) to allow the indexing crossover. exception is the noise waveform.
the volume data points to the relevant scaled volume table (I did previously experiment in signed volume tables, but this introduced unwanted noise due to carry flag clear/set due to crossover
the decoder has 8 phase accumulators per layer. The main optimisation is that the phase is only updated once per 8 bytes. This is due to the step values already calculated per 8 bytes. At the end of the 8 bytes, the last step value is written to the indexing which allows it to loop seamlessly at the required frequency
Now info about the spectrogram
Yes it is realtime, but nothing too complex about that as we already have access to the frequency and amplitudes per layer.
There are 2 main stages, the fader (which fades each level of the frequencies) and the plotter (which plots the main updated strips)
the sample rate is at 5200hz (this is rather low) but as i am calculating pitch tables on the fly for 8 layers (and the main decode has phase accumulators updating in realtime) this used quite a lot of cpu time (though saves a lot on precalculated data).
I do have variants of the decoder that use larger tables to precalculate both volume and waveform together and run at higher sample rates, but use over 30k of tables (compared to 5k). Precalculating pitchs to the waveforms is not an option either as i would need over 1000 of these per waveform
Average compression per second is between 300-700 bytes per second, however since releasing this demo, I have implemented "phase quantisation" where i can shave off roughly 100-200 bytes from the total without much loss in quality
The encoder and decoder is being updated continiously with improvements (In particular i am working more on improving the quality at sub 300 bytes a second)
Some people were curious about how it would sound with some voices. I have prepared some examples. Bear in mind that encoding takes a very very long time, hence I used the quantised phase options to encode faster as a preview (This gives roughly a 25% degradation in quality) but at the plus point of using on average under 450 bytes a second for both these examples. examples of three voiced (with instrumentation combined) here. https://youtu.be/LgIcf25rTiw - isokadded on the 2024-07-25 14:39:05
- demo Commodore 64 Wonderland XIV by Censor Design [web]
- Great demo and loved the end part. I would guess based on the code in zero page, that the background repeating music section is updated once per 2 nmi updates with the speech digi updated in another channel at double the sample update (where the speech is streamed from disk where need be) Utilises the same ADSR digi method used in wonderland 13 (prodigy part). hence 8580 recommended (and it can vary in quality - boot from disk 4 - to check quality level). Spare channel free in channel 2 and that may be used for bass and other accompanying sections
- rulezadded on the 2023-06-12 23:32:31
- demo Commodore 64 TCBI50k by Algotech
- Algorithm Presetns TCBI50k (That Cant Be In 50k(
Some production and technical notes as below
This is a single filer that utilises multilayer decompression based on VQ for the background ambience and a speech layer for the singing.
A Less cpu intensive approach would have simply been to play back the ambience via d418 and use the sid's 3 oscillators for the speech. However i did not opt for this approach due to crossmodulation issues when using the classic d418 approach as well as lower bit resolution - also that it would sound rather silent on a new sid chip (unless i was able to spare a channel for the pulse waveform preboosting).
Hence the approach that i used was to digitally mix everything to 8bits (including 3 emulated sine oscillators)
By looking at the code you will notice that there could have been a major optimisation to use in the sine mixing (Infact this was used in previous frodigi's such as frodigi 8). However i did not have enough ram for the additional tables, hence resorted to using a master sinewave across two pages (consisting of 6bitvalues) and then mapping that over to one of 32 64 byte lookups (for the volume of the sine). 8 step integer values are used to allow each sine to have frequency steps of 4.1hz
The first stage of preparation was to seperate the speech from the audio track. Once this was done, The background audio track was resequenced, omitting out any similar sections and then packed via Vector Quantisation. This resulted in approximately 32k of packed ambience data with a 1k codebook.
The second stage of preparation was to pack the speech. Again, identical sections were omitted out and then packed via my speech encoder. This packs to 45bits per frame (10bits frequency + 5bits amplitude) * 3 channels. This data was around 18k or so in size
This speech data was then sequenced to be triggered at precise times while the ambience is playing back, mixing all the sines together with the ambience and pushing this into the stack, with the actual playback routine just playing backwards constantly from the stack. The result is over 3 minutes of audio matching the original all under 50k packed.
Sample playback rate is at 8400hz and there was barely any space or cpu time for anything else (infact the code is scattered all across the ram, as i needed to put the volume tables (32 of them) at the beginning of each page..
For digimax users. set it to $de00. Then when the logo appears.. Hold down the space bar till the demo starts. For non-digimax users, a 8580 sid is recommended and above all, it will not work on digiboost modded sid's
Enjoy - isokadded on the 2020-12-27 16:49:42
- demo Commodore 64 Sabrina 2020 by Algotech
- Please view this only on a real c64 or an emulator with a freesync or 50hz display, it makes heavy use of frame updates with offsets and shifting to generate the video frames. uneven updates will destroy the quality further. For this reason same applies to youtube video footage that is just recorded from an emulator and frames decimated by the conversion
quick notes as follows
16khz sample playback with a more optimised ssdpcm2 audio decompressor. The whole song is there with not a single piece missed out
50fps video update (not partial). frame sizes range from 240bytes to 1k a frame. As the source video content was 25fps, i am using frame trailing and offset changes at 50fps. (There are some sequences later on that are running at individual frames per frame update
i have prepared a video using frame blending here for a rough preview. On a crt screen it will flicker more, but frame updates should be stable as well.
8580 sid recommended (although may produce worse audio on some 8580's too). 6581 not recommended and will vary widely in quality
The whole demo is packed to a 1mb easyflash cart.
Video preview here (may have limited life) https://streamable.com/3mkrr - isokadded on the 2020-03-31 15:35:44
- demo Commodore 64 Wilde (Easyflash) by Onslaught [web]
- Ok, here are the production and technical notes for this demo. Lets start off with some questions and answers.
Easyflash?.. Cheating.
Yes, it is to an extent but only in regards to storage and data retrieval speed. The CPU does not miraculously get overclocked. What you are still seeing is full screen video decompression and sample decompression in realtime on a 1mhz C64 fitting everything into less than 1mb of storage.
So what is new from the previous bananarama easyflash demo?
The bananarama demo internally was just a simplified animation puller from easyflash without any additional decompression (apart from the tile lookups). Frames were directly decoded from easyflash cart in one of two compression methods (CSAM and TileVQ 2x2). That is pretty much it. Mainloop was not used and the audio and video decompression was done inside the IRQ in chunks with no additional threads.
This new demo has quite a few advancements. Notably is the higher fidelity audio which is encoded using my most up to date audio encoder (SSDPCM2 V3) The sample rate is also increased by over 30%. This obviously means more cpu usage for updating the samples as well as decoding them leaving less cpu time free (Along with more complexity in the audio decode in comparison)
The internal framework is more advanced and has tasks running from the IRQ with audio decode constantly occuring in the IRQ using fixed cpu time - as in the bananarama demo. Data is pushed to the stack with the NMI reading from the stack backwards and relocating the stack pointer per frame.
Video decode and other cpu intensive tasks that would take more than a frame to process are pushed from the IRQ to interrupt the main loop and processed. These tasks can individually access easyflash banks seperately to the other tasks.
The main loop area is usually for the loader. This loader can load and decompress to ram seperate subsections of the demo from easyflash. Again, this is independant to the other easyflash banks that can be accessed from other tasks and can occur at the same stage without any conflicts.
In a nutshell. This basically means data loading and decompression to ram from Easyflash while audio and video is being decompressed from easyflash in the background. This allows maximisation of storage.
bananarama demo was constant 25fps. this demo however has far lower in some areas.
increased complexity in audio decode and a higher sample rate uses much more cpu time in comparison. The main killer is NMI updates - each update at over 30 cycles on average - including the overhead of the nmi being started. multiply this by 216 updates per frame and that equates to over 100 rasterlines worth of cpu time (just for playback from a rolling 256 byte buffer, not including the decompression) It was a decision i made to ensure that audio had more priority. In some sections the framerate is deliberately reduced in order to give more cpu time for the loader to depack the next subsection in time.
In order to increase the apparent frame rate for the video decode i made the decision to write to the same buffer and to decode horizontally. This gives partial frame updates of the next frame with the previous one (for scenes with non drastic differences, it appears smoother due to the new frame gradually replaced into the old frame)
Eww. frame bopping up and down and color randomly changing. broken?
There is a disclaimer at the beginning of the demo on what may cause this. If it is captured by some amateur and put on youtube without heeding this - or viewed on a desktop pc at 60hz monitor output or even worse, converted to 25/30hz, it will destroy the whole video effect in the demo. This will result in the image randomly appearing to move up and down and to the left and right. Colors will appear to randomly change and appear broken. Also some of the "truecolor" merging in two of the parts will be totally broken as well as the chromatic aberration effects in the video decode.
What is recommended ideally is a real c64 output at 50.125hz onto a CRT . With the solid frameupdate, there will be done of the image bopping and the random color changes. However there will still be flicker. Bear this in mind (less so if the CRT has some phosphor delay)
now onto the more details description
SAMPLE PLAYBACK AND DECODE
An optimised quad-delta per frame waveform recreator (SSDPCM2 V3) is used. I wont go into further detail in this method (It was described in my previous C64 demo that did it at 16khz) The original audio was split into 4 bar segments and recreated to give the full audio from start to end with all vocals and variations into a smaller space. Each of these 4 bar patterns were additionally packed using the SSDPCM2 V3 method into a quarter of the original size.
The sample rate is at 11khz and is just high enough to be able to hear the cymbals with less aliasing. After resequencing and packing. The whole audio takes up less than 450k of storage. This is roughly equivalent to 16kbs bitrate (2 kilobytes per second)
the audio decoder uses a stack pushing method to place decoded bytes into the stack. This is done ahead of time from the NMI that is reading the stack backwards. To ensure there are no issues, the stack data for the code and processes are relocated per frame update to ensure there are no conflicts and that the NMI can read only the actual sample data constantly with no corruption.
the actual output method used is via d418 and pre-filter setup allowing 6bits or so of bit resolution output. This method is not fully stable however in particular on c64's with a old sid chip (6581). For this reason it is recommended to use a new sid (8580) which is more stable in regards to its filters - although again, there can be some variance.
Digiboost mods on the sid chip will result in the audio quality being destroyed however. Keep that in mind. There is no current support for this. Dual sid setups can also be an issue due to the autodetector detecting another chip or some external capacitors on the board affecting the quality.
THE LOADER AND FRAMEWORK
I have implemented a framework for this demo that allows multiple tasks to utilise different easyflash banks and not to interfere with each other. Status of $01 register can also be different within each task.
The loader can depack data using either doynax or lzwvl depackers. lzwvl was used for most of the depacking of data for subsections to ram due to its very fast speed - at the expense of worse compression.
As mentioned before, the framework can push tasks to interrupt the main loop and then continue from the mainloop when finished. This is useful for processing data that may take more than one frame
VIDEO DECODE AND PLAYBACK.
In order to get any form of "decent" video playback with all the sample decode and playback in realtime, I had to resort to using the good old CSAM compressed images. These were packed further by tiling them to either 1x2 or 2x2 chars to take half or a quarter of the original CSAM size.
If you do not know what csam is. Its my video/frame encoder that uses my own implementation of clustering via genetic encoding to allow only 256 chars to represent a whole sequence of specific video frames with some other options such as masking/weighting, dct-matching etc.
It works by putting in a random population - each member being a character, selecting fitness of each "member". giving the members a lifespan based on their fitness. bringing more population to the group. Mating with a close match. then based on the new fitness, giving a lifespan + age. this is constantly refined.
In the demo, all of the "background" video sequences are encoded using CSAM together with either tile 2x2 or tile 1x2 reduction. Frames are decoded directly from easyflash banks.
For "post processing", due to the limited amount of cpu time available (after audio decode and frame decode) I am relying entirely on the output of the C64 being as it should be (50.125hz).
For this reason, there is a disclaimer at the beginning of the demo. If it is captured by some amateur and put on youtube without heeding this - or viewed on a desktop pc at 60hz monitor output or even worse, converted to 25/30hz, it will destroy the whole video effect in the demo. This will result in the image randomly appearing to move up and down and to the left and right. Colors will appear to randomly change and appear broken. Also some of the "truecolor" merging in two of the parts will be totally broken as well as the chromatic aberration effects in the video decode.
For youtube or similar video capturing, it is recommended that you use a capture card and know what you are doing. If using an emulator, at least apply some post processing to merge the frames together. I have seen some awful stuff even using a low powered pc and direct desktop capture - resulting in audio stuttering and frame freezes.
"Deblocking" is used by merely shifting d011 and d016 per second frame. by combining different palettes per frame, this has a sideeffect of the shifting giving an impression of chromatic changes in edges.
Some of the parts utilise triple frame merging and with a 50hz stable display, this merges the frames to give an impression of more colors.
Additional notes
Most of the demo was done on a few evenings just before the start of the switchon party. However the audio and framework (as well as encoding of the sequences) was done quite some time ago and then later continued on recently.
In order to simplify production of the seperate subsections of the demo, i used a dispatch system where individual parts could be smoothly displayed and then be placed back to the background animation player at any point and time with loading and attachment to the existing irq.
Thanks to agod and jammer for the additional data (graphics and audio for the end part) - isokadded on the 2018-06-10 23:11:24
- demo Commodore 64 VF-SSDPCM1 Super Plus by Algotech
- Technical details as follows
Another one of them streaming sample demos
------------------------------------------
I would like to call it more "proof of concept" and a demonstration of the output produced by the encoder. In a few words, this demonstrates the successor to ssdpcm1-super with increased quality at nearly half the size. The aim is about size reduction and increased quality in comparison to the ssdpcm1-super method (and not to be compared with the far audibly higher quality ssdpcm2 v3) which produces larger packed files.
In its current stage, there are still some other enhancements that can be made to it that will more than likely be used as part of a demo (Although not taking all this amount of space as this demo does)
History of SSDPCM1 and variants
-------------------------------
SSDPCM2 will not be included in this section, only SSDPCM1.
SSDPCM1 is a 1bit sample shaping method which rebuilds the sample by incrementing or decrementing the current sample via step sizes that are changed and adjusted based on the changing characteristics of a sample
The very first demonstration of the SSDPCM1 method was used in my demo "channels" This method was the most basic and would merely select one step size for a given chunk and attach this to a stream of 1bit packed data. This single step size would serve as a positive and negative value. The stepsize values would change per chunk to accomodate changes in the waveform more precisely (Rather than just one constant step value for the entire sample stream)
the bitstream would then be read and a setbit would indicate to add the step size to the current sample and a clear bit to subtract from the current sample.
It worked pretty well but would result in quite a lot of artifacts when dealing with more complex waveforms
The second demonstration of ssdpcm1 was used as the end part for my algodreams demo. The decoder worked in exactly the same way, however the encoder would have 8 bytes lookahead allowing worsening changes if required that would result in less errors later. Quality was noticably improved and no change would be required in the encoder.
For both the above methods, step values were updated per 64 or 128 samples. It really was not feasable to have step size per frame or so (e.g for 8400hz, one step size per 168 samples). Quality would suffer.
Then there was a drastic change to the SSDPCM1 method and was known as SSDPCM1-Super. (This was demonstrated in my "axel f" demo.
This method would increase the file size (but not by that much) and quality would be a lot higher.
The encoder would brute force two step values per chunk and would choose which one of the two to go "through the path" per 8 bytes.
The advantage in this approach was that it would require only 1bit of additional data per 8 bytes which would give the decoder the corrent step value to use to decode 8 bytes.
Overall, the added file size would then be (((samplerate/50)/8)/8)+2 bytes per frame added to the bitstream (samplerate/50)/8
In the case of comparing it to ssdpcm1 at 8400hz for example, following file sizes would be per second
SSDPCM1 1100 bytes
SSDPCM1-Super 1350 bytes
As the fidelity was improved vastly, I could get away with changing step sizes only once per frame (this would also make the decoder more faster). In the case of the Axel F demo, I was updating at 10800hz which equates to only 1650 bytes per second packed.
And now its time for VF-SSDPCM1-Super..
So what is this VF-SSDPCM1-Super Plus?
--------------------------------------
File sizes up to half as small as SSDPCM1-Super but at even higher quality. How does it work?
The encoder analyses the spectral content of each chunk and its neighbouring chunks and using psycoaccoustic masking, it determines whether or not to halve or quarter the size of the chunk this is also based on the amount of low/medium/high frequency strength within a chunk.
If there are only low frequencies present, lets assume <2000hz with an original sample rate of 8000hz, we can get away by halving the sample rate to 4000hz without affecting quality too much (and less issue of aliasing due to non-existance or weaker signals >2000hz.
Hence per chunk, the samples get reduced in size and expanded by the decoder to reproduce the sample.
Now that is pretty straightforward. The key however is to reduce the pumping or fluttering that would occur when decoding these mixed chunks. This can be achieved by more post processing but the aim is to have low complexity decode using similar cpu time as the older method (Some experiments were done with interpolation and tweening which worked, but used considerably high amount of cpu time)
As mentioned previously, the encoder analyses the spectral content of the chunk taking into consideration its neighbouring chunks. If the current chunk has more high frequency content than the lower frequencies, but the next chunk has more low frequency content than high, what would the decision be for the current chunk? We compare threshold values between the frequencies looking ahead of time and determine the course of action whether it is to resample that high frequency chunk to low (which would give aliasing) or to retain the high frequency which would cause some transition issue when changing back to low after.
All this data is packed using a much improved version of SDDPCM1-Super which now operates on 4 unique step values per chunk rather than just 2 (which were negative and positive of each other)
For the above enhancement, it only needs two additional bytes per chunk but at the expense of extremely long encoding times due to brute forcing.
To lower the encoding time, the maximum step value boundaries to brute force are reduced or increased per chunk based on its frequency content. So lets now go onto the file sizes.
Lets assume a sample rate of 10400hz.
Full frequency packed data per chunk (frame in this case) would be packed to 34 bytes a frame expanding to 208 bytes when depacked
Half frequency packed data per chunk (frame in this case) would be packed to 19 bytes a frame expanding to 104 bytes when depacked but either applying interpolation or byte doubling to fit 208 bytes
Quarter frequency packed data per chunk (frame in this case) would be packed to 11 bytes a frame expanding to 48 bytes when depacked and then stretching to 208 bytes (with or without interpolation)
The encoder uses tolerance values to determine the minimum averaged amplitude of the frequency bands where it can trigger very low/mid/high encoding of the frequencies, hence compression can be tweaked.
Overall lets sum up some compression ratios of SSDPCM1, SSDPCM1-Super and VF-SSDPCM1-Super Plus (For sample rate of 10400hz)
SSDPCM1 1350 bytes per second 7.7:1
SSDPCM1-Super 1600 bytes per second 6.5:1
Now with the VF-SSDPCM1-Super method, the bytes per second would vary based on the verylow/mid/high frequency content and tolerences in the encoder. It can be as low as 550 bytes per frame or as high as 1700 bytes per frame. Based on most audio content, inbetween value of 1000-1100 bytes per second would be the average file size. (Around 10:1) Not bad for something that is higher in fidelity to the SSDPCM1-Super.
Lets get onto the details of the demo
-------------------------------------
Proof of concept. Its just some text with a madonna picture and some real time visualiser (crude blocks) on top. Its to demonstrate the VF-SSDPCM1-Super Plus method, nothing more or less.
The madonna track used in the demo is the full audio from start to end with barely anything missed out (with only some subtle sections if you notice).
As it contains some repeating chorus and verse sections, i was able to reduce the amount of samples. However this still equated to over two minutes of unique samples.
Each 4 bar pattern was then packed giving an approximate compression of 9:1 or 10:1
The previous version utilised a very small decoder which would push decoded bytes to the stack. However i decided to opt for a page switching non-stackless routine which was unrolled (and occupied approximately 8k of ram). This also used less cpu usage than the stack push method (albeit using much more ram)
The visualiser is in realtime and is based on the actual output of the decoded samples at that moment
I have used an updated scheduler in the stream chunk loading which simplifies the creation of segment order for the sample loading.
Previously i would have trigger points to load relevant packed samples to required slots and then to play these back in specific order which did work well but was quite a pain to manually put together.
This new scheduler simply just forwards the relevant 4 segment pattern data to the scheduler which then performs the loading (or copying from other slots if they already exist in other places) to prevent reloading of that required sample segment. This is all done while the decoder is playing back previously loaded segments.
Can it be optimised and improved further
---------------------------
Indeed it can. The actual decoder i am using is a branchless method (which is more suited for interleaved decode). It was quick to put together and was able to stream in time, hence i used this. This is a testbed for the more optimised decoder/improvements to come when its utilised as part of a real demo.
One thing to note however is that there is always a tradeoff between cpu usage and possibility of streaming from floppy. To allow free streaming with no buffer overrun, the cpu usage in decode needs to be low AND file sizes of packed data needs to be below a specific amount in order to do this, and if requiring a full song with vocals, final file size is of importance in particular when realising that there is not much storage space available.
There are ways of getting around high cpu usage and lower disk load speed, but that involves prebuffering many segments and repeating loops (look at my SSDCPM 16khz demo as an example :-))
I was considering subsets of 2 or 1 bit streams with step values, but my aim was to have filesize nearer to the 1k mark and to achieve this via the waveshaping method, i opted for the frequency adjustments instead of controlbit adjustments
Sample quality issues
---------------------
Yes, due to the high compression, sample quality will suffer (but the goal was to improve on ssdpcm1-super with much higher compression and that was achieved)
However as a reminder. This demo is only for 8580 NEWSID. Dualsid not supported (and many issues with this due to detection and even if correctly detected, filter caps may have more variance). Nonetheless, there is audodetect for old sid too, but in most cases, it will sound much worse. If you have a digiboost mod on your sid, then not much point running the demo unless you want some severe distortion.
Digimax is supported (If you have one of these devices at $DE00) then hold down space when the notice comes up at the bottom of the screen until the screen turns black
If using an emulator, Use Resid and 8580 (If using Vice). Micro64 is fine using 8580. HoxS64 does not seem to work (due to incomplete drive emulation) and even if it does work in future, make sure to turn off digiboost
Other Info
----------
Thanks to krill, where full credit for code goes to him for the loader. If you experience any issues (unless it is 1541u1 or sd2iec) please get in touch. If copying it run on a real floppy, please make sure that you copy the full 40 tracks and not the default. - isokadded on the 2018-04-06 21:02:53
- demo Commodore 64 SSDPCM2 V3 - 16Khz by Algotech
- bugfixed version uploaded. Should hopefully appear as a link soon. Otherwise can download from http://csdb.dk/release/download.php?id=201252
fixes the following issues...
audio glitch for a few frames while transitioning between text routine and scroller routine
crashes or sample corruption after first run or possibly first run due to overrunning d800 clear which sends wrong values back to main loop
new beta version of krill's loader - isokadded on the 2018-03-10 19:12:16
- demo Commodore 64 SSDPCM2 V3 - 16Khz by Algotech
- Technical details as follows as well as questions and answers
Another sample playback with streaming?
---------------------------------------
Add 16khz and realtime ssdpcm2 V3 depacking on the fly to the equation together with cpu time left for streaming and other gfx effects, vertical scrolling of images on a <1mhz cpu.
What is SSDPCM2
--------------
It is a lossy compression algorithm for audio developed by myself that is specifically designed for low cpu usage when decoding. Its similar to the line of DPCM/ADPCM methods where the sample is shaped when depacking to approximate the original waveform as close as possible.
To save on cpu usage when decoding, optimum step sizes are brute forced per chunk ensuring that it minimises additional calculations during decoding (such as analysis of repeating bitpatterns to adjust step sizes on the fly)
There are 3 versions of ssdpcm that I had developed (This demo using the SSDPCM2 V3 encoder/decoder)
SSDPCM2 V1
This was used in Just dance 64 demo (3.9khz), sampleblaster demo (7.8khz) as well as the 15khz (screen off) digi part in algodreams. (Was also used in one of my previous amiga demo's too)
The encoder works by going through two step sizes brute force (which are each then converted to negative and positive step sizes)
The source sample sequences are compared with the results of the step addition or subtractions to the shaped sample, and the nearest match is used and error recorded. A lookahead of 2 bytes is used in a later version to allow first move to be worsening (this increased quality overall) This addition was just not enough to warrant an new version number)
The step sizes with the least error are used for a given chunk. step sizes can be adjusted from 32 bytes per chunk onwards regardless of sample rate.
The decoder uses the direct byte to codetable approach which consists of 256 dedicated optimised codechunks to decode the sample
SSDPCM2 V2
An advancement in Version 1 and also resulting in less cpu usage for decode. This was used in the easybananaflashrama demo.
This uses a brute force approach same as V1, with the step values when comparing to source with 4-8 byte lookahead to allow worsening step values to be used which in turn would result in higher quality overall. Windowing scheme is also used (as there is just no computer available to brute force all combinations for the whole sample chunk)
Again two step values are used with each step value being converted to negative and positive. The encoder output is backwards compatible with the older V1 decoder.
However I have updated the decoder now to push decoded bytes to stack ahead of time (before nmi that reads the stack). this results in lower cpu usage for decode in comparison.
SSDPCM2 V3
This has some dramatic changes under the hood at the expense of extremely high computational load on encoding which will be explained further.
The encoder can now generate 4 unique step values. This was very tempting initially some time back, but the current decoder would not work without some changes that used more cpu time for decode hence was put on standstill.
In order for the utilisation of 4 unique step values (any minus or positive number) to be a possibility at low cpu usage, the decoder has changed completely. In the last two versions, each byte would be translated to dedicated code chunks which would subtract or add specific step values based on the bit pattern in the byte.
Now using the 4 unique step size approach, this would not be a possibility unless....
Adding data to subtract. Just modifying each code fragment to constantly add specific step values to the current sample can be done, but would use additional cycles by having to clear the carry flag between each add (as state of whether the step value generates a carry to the current byte is unknown) There is also the overhead of setting up of the codetable to jump to.
The decoder that I have created instead is optimised more for interleaved decoding but is only slightly slower than the codetable approach. by a few cycles per 4 decoded bytes.
As V1 of the encoder, the encoder can now also generate unique step values more often for higher quality increase. (This thankfully does not increase cpu time, but does increase file size a little bit)
The encoder analyses the chunk and determines its complexity and frequency content. Higher frequencies would indicate the use of higher step values and lower frequencies lower step values.
based on this, the maximum step size is generated. In reality, the actual maximum optimum step size selected would be approximately twice as low (if using the approach of comparing delta's between bytes)
The encoder has to brute force all 4 unique combinations of step size. Note the word unique.. 64x64x64x64 would be 24 million iterations, but can be reduced to a lot less if only utilising unique values (and ignoring order and repeated values. (previous versions of ssdpcm encoder also used the unique values only) But even then, you may be able to get a rough example of the amount of work it has to go through e.g with lets say (-$1f to +$1f) per step size
In addition to this, it brute forces multiple bytes allowing worsening moves and windows across the whole chunk between each step iteration.
Why another sample demo
-----------------------
Well.. If i was to release V3 of ssdpcm at 8.4hz, then it would be pretty much similar to demos i did previously at similar sample rates (E.g easybananaflashrama, sampleblaster etc). I decided to opt for something more drastic...
As i mentioned previously, the decoder has been coded to be optimised more for interleaved decode. The main killer in digi-demo's when it comes to cpu usage is the actual updating of samples via non maskable interrupt for low sample rates, this is not too much of an issue, but for higher ones, lets assume 16khz and having to play back a buffer via nmi and writing to sid registers, it can easily occupy all rastertime alone just for the playback.
The drastic approach and end result which i wanted was to have 16khz sample playback with screen on and d011 insensitive (free scrolling is not an issue) as well as realtime ssdpcm sample decoding on the fly while still giving free cpu time for other things.
This was achieved by having the full decode interleaved outside of the screen area via stable irq and writing per line to sid for the 16khz. this decodes more bytes than the sid register writes within the interleaved code. After this, the nmi continues with the rest but needs to be precisely timed to start so that badlines will not interfere with it. it is also timed to ensure that the nmi does not occur while the irq and stabilisation occurs before the interleaved writes. Additional sid writes are done at precise positions inbetween takeover.
Remember. This demo not only plays back 16khz 8bit samples, it also depacks them in realtime with less quality loss and with a pack rate of approximately 3.6:1 as well as scrolling full screen images and streaming from disk.
Disk streaming
--------------
This demo uses krills unreleased loader which has many neat speed improvements (over 7k a second max load) and ability to load/depack a file anywhere without being restricted to the two byte load header.
Due to the 16khz playback and decoding, there is approximately 100 lines "spare". I put this in quote marks as each line is occupied by the nmi which eats up approximately 30 cycles per line. This leaves a conservative 55-60 rasterlines free. With additional code such as the scrolling images, spare rastertime is reduced further.
This results in the loader giving approximate load speed at 700 bytes per second.
There are 34 two bar patterns each packed to 3.5k on the disk occupying 120k of disk space or so.
I had decided to opt for a modified slot load approach loading in multiple samples in one hit rather than predefined timing positions.
Per 32 6tick patterns, the loader will load the specified 5 samples to the specified slots while current sample pattern/slots are played back. This has been found to be a good estimate that is guaranteed (I hope) to load all 5 samples before the next preload. 18k of data to be loaded per 26 seconds.
Even when this method is utilised, bear in mind that the cpu usage of the demo limits the load speed which in turn limits the amount of unique samples that can be played in any order (and in order to achieve this, there is some preloading that occurs while some repetetive samples are played back)
Required hardware/emulator setup
--------------------------------
For best audio quality, please use new sid. The demo does detect sid and does support old sid too but quality will vary a lot on different sids even more. Digiboost hardware mod's are not supported. Audio quality will be destroyed entirely with this active.
If running it on an emulator, Vice and Micro64 are fine. Ensure again that digiboost is turned off and that resid is used as the sid engine). I have found hoxs64 to produce digitized audio using the 6bit+ d418 method quite rough sounding (even with digiboost turned off)
Finally if you wish to hear the samples as they were intended, there is a digimax mode which can be activated by holding down the space bar during boot (when the message appears at the bottom to hold space). Hold space until the screen turns back (and ensure digimax is available at de00). - isokadded on the 2018-03-10 01:06:27
- demo Commodore 64 easybananaflashrama by Algotech
- I felt a bit guilty doing it for easyflash. However bear in mind that the easyflash is "only" for the storage and direct retrieval of the packed data. All depacking of audio and video is still done by the C64 albeit without the load and storage restrictions too much. It was mainly produced as a test for the easyflash cart and made a "demo" out of it. Overall it is only a animation/audio player that loads and depacks the data while playing back.
The hard part was the sequencing of the animations with the audio :-) - isokadded on the 2018-03-08 19:52:47
- demo Commodore 64 easybananaflashrama by Algotech
- Digitized audio is possible on the plus 4. I would guess its register $ff11, but that only seems to have 3bit output. Hence unless there are other new techniques of digitized playback on a plus 4 machine, it would sound very noisy unfortunately. As an example, the C64 version outputs 6bits in this demo
- isokadded on the 2018-02-19 20:12:40
account created on the 2011-01-25 21:59:22