SSDPCM2 V3 - 16Khz by Algotech | ||||||||
---|---|---|---|---|---|---|---|---|
|
||||||||
|
popularity : 52% |
|||||||
alltime top: #28535 |
|
|||||||
|
||||||||
added on the 2018-03-10 00:48:49 by algorithm |
popularity helper
comments
Such amazing quality sound!! Another great production!!
Awesome...
Really impressive sound playback, but can we stop with the "sexy pinup chick" demos yet? It's not 1993 & we're not all 12 anymore. Piggy for now, thumb reserved for when you use the routine in something that doesn't make me feel dumb while watching it.
It really rocks!
submit changes
if this prod is a fake, some info is false or the download link is broken,
do not post about it in the comments, it will get lost.
instead, click here !
Another sample playback with streaming?
---------------------------------------
Add 16khz and realtime ssdpcm2 V3 depacking on the fly to the equation together with cpu time left for streaming and other gfx effects, vertical scrolling of images on a <1mhz cpu.
What is SSDPCM2
--------------
It is a lossy compression algorithm for audio developed by myself that is specifically designed for low cpu usage when decoding. Its similar to the line of DPCM/ADPCM methods where the sample is shaped when depacking to approximate the original waveform as close as possible.
To save on cpu usage when decoding, optimum step sizes are brute forced per chunk ensuring that it minimises additional calculations during decoding (such as analysis of repeating bitpatterns to adjust step sizes on the fly)
There are 3 versions of ssdpcm that I had developed (This demo using the SSDPCM2 V3 encoder/decoder)
SSDPCM2 V1
This was used in Just dance 64 demo (3.9khz), sampleblaster demo (7.8khz) as well as the 15khz (screen off) digi part in algodreams. (Was also used in one of my previous amiga demo's too)
The encoder works by going through two step sizes brute force (which are each then converted to negative and positive step sizes)
The source sample sequences are compared with the results of the step addition or subtractions to the shaped sample, and the nearest match is used and error recorded. A lookahead of 2 bytes is used in a later version to allow first move to be worsening (this increased quality overall) This addition was just not enough to warrant an new version number)
The step sizes with the least error are used for a given chunk. step sizes can be adjusted from 32 bytes per chunk onwards regardless of sample rate.
The decoder uses the direct byte to codetable approach which consists of 256 dedicated optimised codechunks to decode the sample
SSDPCM2 V2
An advancement in Version 1 and also resulting in less cpu usage for decode. This was used in the easybananaflashrama demo.
This uses a brute force approach same as V1, with the step values when comparing to source with 4-8 byte lookahead to allow worsening step values to be used which in turn would result in higher quality overall. Windowing scheme is also used (as there is just no computer available to brute force all combinations for the whole sample chunk)
Again two step values are used with each step value being converted to negative and positive. The encoder output is backwards compatible with the older V1 decoder.
However I have updated the decoder now to push decoded bytes to stack ahead of time (before nmi that reads the stack). this results in lower cpu usage for decode in comparison.
SSDPCM2 V3
This has some dramatic changes under the hood at the expense of extremely high computational load on encoding which will be explained further.
The encoder can now generate 4 unique step values. This was very tempting initially some time back, but the current decoder would not work without some changes that used more cpu time for decode hence was put on standstill.
In order for the utilisation of 4 unique step values (any minus or positive number) to be a possibility at low cpu usage, the decoder has changed completely. In the last two versions, each byte would be translated to dedicated code chunks which would subtract or add specific step values based on the bit pattern in the byte.
Now using the 4 unique step size approach, this would not be a possibility unless....
Adding data to subtract. Just modifying each code fragment to constantly add specific step values to the current sample can be done, but would use additional cycles by having to clear the carry flag between each add (as state of whether the step value generates a carry to the current byte is unknown) There is also the overhead of setting up of the codetable to jump to.
The decoder that I have created instead is optimised more for interleaved decoding but is only slightly slower than the codetable approach. by a few cycles per 4 decoded bytes.
As V1 of the encoder, the encoder can now also generate unique step values more often for higher quality increase. (This thankfully does not increase cpu time, but does increase file size a little bit)
The encoder analyses the chunk and determines its complexity and frequency content. Higher frequencies would indicate the use of higher step values and lower frequencies lower step values.
based on this, the maximum step size is generated. In reality, the actual maximum optimum step size selected would be approximately twice as low (if using the approach of comparing delta's between bytes)
The encoder has to brute force all 4 unique combinations of step size. Note the word unique.. 64x64x64x64 would be 24 million iterations, but can be reduced to a lot less if only utilising unique values (and ignoring order and repeated values. (previous versions of ssdpcm encoder also used the unique values only) But even then, you may be able to get a rough example of the amount of work it has to go through e.g with lets say (-$1f to +$1f) per step size
In addition to this, it brute forces multiple bytes allowing worsening moves and windows across the whole chunk between each step iteration.
Why another sample demo
-----------------------
Well.. If i was to release V3 of ssdpcm at 8.4hz, then it would be pretty much similar to demos i did previously at similar sample rates (E.g easybananaflashrama, sampleblaster etc). I decided to opt for something more drastic...
As i mentioned previously, the decoder has been coded to be optimised more for interleaved decode. The main killer in digi-demo's when it comes to cpu usage is the actual updating of samples via non maskable interrupt for low sample rates, this is not too much of an issue, but for higher ones, lets assume 16khz and having to play back a buffer via nmi and writing to sid registers, it can easily occupy all rastertime alone just for the playback.
The drastic approach and end result which i wanted was to have 16khz sample playback with screen on and d011 insensitive (free scrolling is not an issue) as well as realtime ssdpcm sample decoding on the fly while still giving free cpu time for other things.
This was achieved by having the full decode interleaved outside of the screen area via stable irq and writing per line to sid for the 16khz. this decodes more bytes than the sid register writes within the interleaved code. After this, the nmi continues with the rest but needs to be precisely timed to start so that badlines will not interfere with it. it is also timed to ensure that the nmi does not occur while the irq and stabilisation occurs before the interleaved writes. Additional sid writes are done at precise positions inbetween takeover.
Remember. This demo not only plays back 16khz 8bit samples, it also depacks them in realtime with less quality loss and with a pack rate of approximately 3.6:1 as well as scrolling full screen images and streaming from disk.
Disk streaming
--------------
This demo uses krills unreleased loader which has many neat speed improvements (over 7k a second max load) and ability to load/depack a file anywhere without being restricted to the two byte load header.
Due to the 16khz playback and decoding, there is approximately 100 lines "spare". I put this in quote marks as each line is occupied by the nmi which eats up approximately 30 cycles per line. This leaves a conservative 55-60 rasterlines free. With additional code such as the scrolling images, spare rastertime is reduced further.
This results in the loader giving approximate load speed at 700 bytes per second.
There are 34 two bar patterns each packed to 3.5k on the disk occupying 120k of disk space or so.
I had decided to opt for a modified slot load approach loading in multiple samples in one hit rather than predefined timing positions.
Per 32 6tick patterns, the loader will load the specified 5 samples to the specified slots while current sample pattern/slots are played back. This has been found to be a good estimate that is guaranteed (I hope) to load all 5 samples before the next preload. 18k of data to be loaded per 26 seconds.
Even when this method is utilised, bear in mind that the cpu usage of the demo limits the load speed which in turn limits the amount of unique samples that can be played in any order (and in order to achieve this, there is some preloading that occurs while some repetetive samples are played back)
Required hardware/emulator setup
--------------------------------
For best audio quality, please use new sid. The demo does detect sid and does support old sid too but quality will vary a lot on different sids even more. Digiboost hardware mod's are not supported. Audio quality will be destroyed entirely with this active.
If running it on an emulator, Vice and Micro64 are fine. Ensure again that digiboost is turned off and that resid is used as the sid engine). I have found hoxs64 to produce digitized audio using the 6bit+ d418 method quite rough sounding (even with digiboost turned off)
Finally if you wish to hear the samples as they were intended, there is a digimax mode which can be activated by holding down the space bar during boot (when the message appears at the bottom to hold space). Hold space until the screen turns back (and ensure digimax is available at de00).