algorithm information 172 glöps

- general:
- level: user
- personal:
- first name: Naveed
- last name: Khugiani
- demo Commodore 64 easybananaflashrama by Algotech
- Most of the questions you may want to ask will more than likely have the answers as below. So let us start.
WHY EASYFLASH AND NOT D64 OR SINGLE FILER?
A good music video type of demo with digitized audio on the c64 can be done in a single filer or running from a floppy, but there are several restrictions..
Storage space is one of the major ones. This can be circumvented by using a tune that only has a few loops that are sequenced together and some select effects that do not rely on full blown FMV
But as soon as wanting the wish of a whole 3 minute track with vocals at high quality and requirement of thousands of frames of unique video frames, the reality of this being a single filer or a disk based demo is a non-reality
(an exception may be the requirement of multiple disk swapping occasionally, but this would be cumbersome)
Secondly access to data at anytime was another requirement. There are many nice irq loaders including those by krill and bitbreaker and these can load in the region of 6-7k a second providing there is not much cpu usage in the demo while loading.
It could have been a possibility to load and depack data to a buffer while previous video frames were being played back, but as the cpu usage on the 25 frames per second video update and digital audio is constant most of the time, the loader would take considerable amount of time to load and depack to the other buffer leaving some delays. Remember the first point mentioned in the sentence "access to data at anytime"
The video consists of many thousands of frames each of which can be replayed back at any point without any delay whatsover
So overall, reasons for it being easyflash is that i need the storage space for thousands of frames and high quality full audio with vocals and also required data on immediate demand. However i did not want to go overboard with storage so i limited it to a 1mb easyflash cart.
OK, EASYFLASH. NOW ALL YOU NEED TO DO IS JUST PLAY RAW SAMPLES AND LOAD FRAMES.
Wrong. 1mb may seem like a lot, but in order to fit thousands of frames and 3 minutes of digitized audio, both video and audio require realtime decompression on the fly in order to be packed to 1mb.
Secondly, the easyflash does not make the c64's cpu any more powerful, all code is still run on a c64. Easyflash is just there for the immediate data access and storage (in 16k banks) :-)
BAHH. THIS IS NOT A DEMO, ITS JUST AN ANIMATION PLAYER
Yes, it is an animation player. but one that does 25fps video depack full screen and decode WITH >8khz digitized audio depacking and playback via custom adpcm hybrid decode all on a <1mhz cpu.
NOT GOOD. QUALITY NOWHERE NEAR AS NICE AS THE REU VIDEO PLAYBACK DEMOS
These REU demos would typically use 16mb of data with each frame using more than 16k.. at 25fps, this would equate to over 400k of data per second giving only maximum 40 seconds of playback for unique frames (and no digitized audio)
Quality is excellent no doubt, but when the storage limit falls down to 1mb only and unique frames are much more as well as having a full 3 minute digitized soundtrack, there has to be a compromise in quality
TECH DETAILS OF THE DEMO PLEASE
Ok, lets start off with the audio.
The original track was over 3 minutes and would have took up approximately 1.7mb at 8.4khz if unpacked. My goal was to squeeze this down into around 230k without losing much quality.
I started by splitting the whole song into four bar patterns and removing redundant data. The end result was 52 four bar patterns that can be resequenced to form back the entire song.
The result in file shrinkage due to this was 2:1 resulting in the audio now taking approximately 850k.
I then packed each four bar pattern using my ssdpcm2-v2 audio encoder. This packs one second of raw 8.4khz samples (8400 bytes) to 2.2k resulting in overal pack rate of nearly 4:1 finally the whole tune is now in approximately 230k.
The ssdpcm2-v2 audio technique is an improvement over the ssdpcm2 method used in sampleblaster and algodreams in that it has double the lookahead bytes and uses windowing system with brute force to generate the optimum waveshaping parameters per chunk
The decoder on the c64 side uses codetables that translate compressed byte data to code segments with each codesegment shaping the waveform appropriately and convenientaly pushing it to stack ahead of time.
The NMI simply plays back the stack backwards making sure that important stack data gets saved/restored per frame
The playback method is using presetup d418 method which produces higher than 4bits output. It is strongly recommended to run this on a c64 with a new-sid. Any digiboost hacks on the sid will result in this demo producing garbage
as sound.
For this reason, if you are also running it on an emulator, ensure digiboost is turned off and select newsid (and resid, not fastsid). I also have autodetect for old sid with relevant table used, but results will vary. (Audio will sound louder and more crackly)
Overall, estimated cpu usage for the actual digitized audio playback and the ssdpcmv2 decode is around 50% of cpu per frame.
Now some information on the video.
All video in the demo is running at 25fps. In order for this to be achieved, a full frame needs to be read and decoded in two frames before switching over to the updated buffer.
This brings us to one of the main section that i was not too happy with. The compression of frames is not optimal.
Some solution to this (which would have then given the restriction of not allowing any frame accessible at any point) would have been to depack chunks to ram and then depack the semi-packed data when finished or to utilise delta encoding in the encoder so that only changes are plotted. (This would actually have saved cpu time for some cases)
In order to get 25fps playback (and bearing in mind that more than half the cpu is used for the digitized replay and decode) there is little time to use other efficient packing methods, but 240 bytes per frame is fine as it is in this
case.
Most of the video data is compressed using my updated TileX2 encoder which further shrinks CSAM frames to either 60 bytes, 240 bytes or 480 bytes. I decided to opt for the middle option and to also intermingle these with full csam frames in the demo (Internal updated build of CSAM which utilises DCT preconversion). As mentioned before, these semi-packed frames are directly written to the cart without any further packing.
The decode is straightforward. read 240 byte frame from cart, decode to 4 tilelookups and plot to screen.
TileX2 mode does result in blocky output (in comparison to standard csam frame) for this reason, i use d016/d011 shifting per frame to reduce the appearance of larger blocks. Yes i am aware that it is a quick workaround but it IS effective provided that it is run on either a real c64 or emulator with monitor at 50hz output
Information on the core of the demo
The core of the demo is in the music sequence replayer where i have placed sync markers which trigger one of many video sequences (and video depack types).
When the trigger is activated, the decode of the video starts (after variable delay, also specified in sync mark). When finished, it falls back to ram image playback synced to the audio until another trigger is generated
CONTENT BREAKDOWN
1 2x2 font (2048 bytes) 002,048 bytes combined
3 codebooks (intro logo+indemo (each 2048 bytes) 006,144 bytes combined
52 four bar packed samples (each packed to 4400 bytes) 239,616 bytes combined (padded to $1200 bytes)
34 gfx codebooks for video (each 2048 bytes) 069,632 bytes combined
30 tileluts for video (each 1024 bytes) 030,720 bytes combined
1772 frames tilevqmode (each 240 bytes) 453,632 bytes combined (padded to $0100 bytes)
215 frames csamdct (each 1000 bytes) 220,160 bytes combined (padded to $0400 bytes)
code and misc (approx 16k) 016,384 bytes combined
As you can see above, the 1mb has been greatly utilised.. however note that packing the actual crt image via zip or similar brings it down to around 700k. this is due to the framedata not being fully packed in the demo)
ACCESS TO DATA
In order to switch banks on the easyflash and actually access the data, $01 needs to be #$37. I am using config switching in easyflash banks to $8000-$9fff and $a000-$bfff. I ensure that i set this back to #$35 after reading the data this is to save cycles on the nmi (which would otherwise run (SEI Jmp ($0318)) from kernal on each update eating up 7 cycles.
The result of this is that there is some variable delay in nmi sample update (due to the $01 switching) but it is not noticable.
DEVELOPMENT TIME
Nearly all of the development time for producing this demo was in the actual encoding and syncing of the demo. Millions of iterations per video sequences.
The actual demo code comprises of only a ssdpcm2-v2 depacker and tilex2/csam video decoder and the sequences trigger the relevant video sections. So the actual code is a animation player that plays back packed audio and video. ofcourse the highlights are full screen 25 frames per second video decompression and playback with realtime decode and playback of digitized audio.
HOW MUCH SPACE WOULD THIS DEMO TAKE IF IT WAS UNPACKED
Audio would take up 1.7mb
Gfx would be approximately 30mb
Hence compression level in this demo is around 32:1 If taking into account the 200 second runtime of video/audio, it would be equiv to a 40kbs stream (5kilobytes a second) - isokadded on the 2018-02-17 00:04:11
- demo Commodore 64 KAOS 64 by Algotech
- KAOS 64
Summary
This single filer C64 demo is a remake of a the classic KAOS part in the amiga budbrain megademo.
How it came about
After experimenting with a few other (more cpu consuming) audio compression routines for the c64, I
decided to revisit the fast VQ decode method as I was planning in mixing and decoding multiple streams on this machine.
The original idea was to mix in two fixed size looped samples to give more "variety", but after starting coding the routine, It was not an issue to mix in 4 vq packed samples and to still leave some spare cpu time.
I decided to adjust the player so that it can read in data similar to a mod (4 tracks, timeline, triggered samples). In a matter of time, I was able to play back 4 channels and trigger any vq sample in the timeline for each track individually.
Some quick adjustments later, I had separate volume control per sample as well as some other gating options as well as playing looped samples (chords etc)
Now, there had to be some sacrifices that were required to be made. Pitch adjustments would have been an easy possibility (via indexing incremental pitch tables and adding at the end to the low offset - as in the frodigi pitch adjustment method). However, I decided to opt for using VQ compression on all channels which would allow me to use 4 times the samples in ram and also allow me to overcome the limitation of no-pitch by using resampled data for notes.
Some modules consist of fixed pitch samples (such as drums, sample loops and precalculated chords). However, the main lead and bassline does not have too many note variations. If for example you would take into account a bassline with 4 pitches, Even with 4 separate samples per pitch, this would compress to the equivalent of a single instrument due to the vq compression. This together with the compression of the fixed pitch samples would generate a mod usually far smaller than the original.
I was going through a few modules and came across the Kaos mod (from budbrain megademo). This consisted of a few loops and fixed pitch samples and only a few different notes per sample for the bassline and lead instruments. It was a good candidate for testing. Once done, I had created a 1:1 conversion and decided to make a demo from it.
conversion process
The conversion process was mundane and time consuming and would have been considerably faster If i had automated this somewhat. However as i had started the exercise, I decided to continue using the manual approach for the mod.
I went through each sample (from beginning to end) and see if there are different frequencies used in the mod for that particular sample. Then on an empty mod, entered these samples with the separate frequencies to play one after another.
e.g. if sample 2 in the mod is played back at d#5 and c#5, enter d#5 and c#5 in a blank mod
the sample chunk would then be saved as a 44100hz wav, this would then be imported into an audio editor, then resampling the whole chunk at 8400hz. The individual notes would then be saved as a multiple of 168 bytes (equiv to a frame of playback at 8400hz) Each note would then be given something easy to identify (e.g sample2-d#5) etc
Then the process was repeated for all the samples used in the mod.
These samples are then converted to 64 intensities and combined and VQ packed with the relevant offset start and length data.
To recreate the mod using the new samples, the new samples would be inserted into the relevant sample slots using default sample rate (8400hz). Any different pitches have the relevant sample inserted with the required volume commands.
For other pitch bending based commands, segments of that sample were extracted and used.
This would then result in a full mod file using these remapped samples that can be played back using my custom vq mod player.
The patterns were split from 64 sections per track to 16, and then all redundant patterns removed, then remapped to the 16 sections. This saved significant ram space.
The mixer / music driver
the core of the routine is a vq lookup reader with each lookup having 256 possible values that points to 1k codetables that are precalculated in 4 levels of volume. the data for each track is mixed together than pushed to stack ahead of time (With the NMI reading from stack earlier).
In order to save more on cpu time, the mixer is dynamic in nature. It can adjust itself using one of 16 routines based on whether any samples with non-zero volume are currently playing back in that frame. If all 4 channels are active in that frame, the typical cpu usage is around 80% while it would be far lower if 3 or less channels are active.
I had done a few experiments in quality of mixing and found that the signed 6bit addition method worked best for the fastest mixing. It may have been possible to even use signed 7bit with careful prereduction of volume without any overflow, but decided to opt for the 6bit scaling to guarantee no overflow regardless of loudness of each channel in the track.
the additional possible carry generated by this addition gives the output some additional "dither noise" which can aid in covering up some imperfections giving its own imperfections as well :-)
the music driver
at the end of all this, a music driver had been produced which only requires an initialisation and then requires an update per frame in irq. The whole kaos track packed took up only 28k (from a 150k mod)
QUESTIONS
why is the quality so bad
For what it does in a very small size (28k) (from a 150k mod) and the tasks involved + running it in a single file on a c64, several compromises obviously had to be made. Some reasons for degradation as follows
original samples were not good in quality
Listen to the original samples in the KAOS track, and you will indeed hear that they are pretty rough sounding in nature
samples are resampled to 8400hz
most of the samples are way above 8400hz in the original. These required resampling to fit the c64's playback rate in the mixer to 8400hz
VQ packed samples
dont expect a quick and fast solution that packs to a quarter of the size to give any miracles. (Even though my VQ encoder works in a similar way to CSAM and refines entries, it cant give lossless perception at this 4:1 rate)
6bit scaled signed samples. from -128 to +127 in the original scaling it down to -32 to +31 will no doubt generate worse quality resolution and even further when volume adjustments are made to this further (reduction in volume)
sample playback method
the method uses a direct write to d418 with sid pre-setup which can give higher than 6bits output (instead of 4bits) using d418. This is not perfectly linear, hence there will be quality loss issues. Also the mixing in the end produces 8bit sample output and this would then need to be translated back to the non-linear 6bit which gives further quality loss (+ it can vary dependant on sid revision)
Taking into account the above quality loss issues, it still manages to generate samples that are better then expected
Why this, when there is THCM's player
THCM's decoder/player is specifically crafted to be fully 100% mod compatible with all commands supported. If you want full mod compatibility, then THCM's one is the choice. As samples are unpacked raw, these will either need to be resampled to low rates or to use small size samples.
My players' purpose is completely different. IT is meant for playing back very large amount of samples in a smaller ram space with "some" very basic mod support (e.g placement of samples, some volume control, looping etc). If you want the possibility to playback 160k worth of samples in ram at one time and track these together, this is the player of choice. With some work, it is possible to recreate large size mods nearly 1:1 while only using less than a quarter of its original size.
As the cpu usage is proportional to the amount of channels, the composer can use specific amount of channels during sections in a demo where cpu usage is required more for a certain part, hence the dynamic mixing can have quite some advantage when used correctly.
tech specs and limitations
full 4 channel tracked samples with realtime vq decompression on the fly for each sample.
4 volume levels per sample (0%,33%,66%,100%)
gating (pulsing effects to zero at variable speed
6bit samples mixed to 8bit and output via amplitude table
8400hz playback
limitations
no pitch adjustment (costly for cpu when VQ per channel is involved)
each sample 6bits (to allow for fast mixing). This can affect clarity in particular when low volume levels are in the sample or low volume is selected as a parameter for the sample.
effects
As the pop out effects can occur at any time in the song, I needed to take into account the worst case scenario for free cpu time.
It would have been trivial in cpu time to place images and then flip them when required via dd00/d018, but as ram space was rather limited for the other effects, this is plotted inplace.
the "vectors" - precalculated csam, is packed using rle and depacked on the fly. The others are just some color adjustments and plotting. I wanted to make it as close to the amiga original hence no other extra effects. - isokadded on the 2017-09-06 02:30:16
- demo Commodore 64 SSDPCM1 Super by Algotech
- Dodke, Don't worry, that is for later (At SSDPCM2 quality) with full music video :-)
- isokadded on the 2017-07-01 23:18:34
- demo Commodore 64 SSDPCM1 Super by Algotech
- Overview
This sample compression method is a enhancement of the ssdpcm1 routine that was originally used in the "channels" demo by algotech
The original ssdpcm1 routine worked via the encoder choosing an optimum single step size for a chunk of a specific time frame (20ms or so)
The decoder would then shape the sampled output via bit stream adjusting the sample upwards or downwards by the changing step sizes
This new implementation has the following enhancements
Encoder
16 bytes look ahead (in comparison with 1 value prediction) this allows worsening decisions that would overall result in higher overall quality per chunk.
There can now be any two step size adjustments in a given 20ms chunk using only a very small amount of additional space. This is done by the encoder selecting one of two step sizes in a chunk per 8 bytes (and marking this in a single bit). Thus a whole 256 byte of sample data only requires 4 extra bytes (256/8)/8 This bypasses the limitation of only using a single step size per chunk allowing one of two optimum step sizes to be used per 8 byte sample. Higher quality using only an additional 10% increase in file size (or lesser if using lower sampling rate)
Decoder
Constant amount of CPU usage per frame using decrementing playback from stack with decoded data pushed to stack ahead of time. Saves a lot of headache and lower cpu usage.
The demo was originally just going to be a proof of concept with a text screen and audio playing back, but had some (small) cpu time left for some simpler effects synced to the audio.
Some more tech details.
The original Axel-F soundtrack was 180 seconds in duration. By cutting up these samples to 4bar segments, was able to reduce this to approximately 100 seconds of unique samples consisting of 50 four bar segments. As the sample rate is nearly 11khz (10800hz), unpacked, this would not only use over a megabyte of disk space, but it would also not be possible to stream it at this sample rate unpacked (Unless having heavily looping repeating sections).
Each 4 bar sample would be 22032 bytes 8bit and consist of over a hundred 216 byte segments.
The 216 byte segments would have their unique step values and condensed into the following
2 bytes - two optimum step values for the chunk
4 bytes - holding 27bits control code on whether to use step1 or step2 for the given 8 byte sections.
27 bytes - bit stream (27*8) = 216 1bit values
Hence 216 bytes are packed to 33 bytes giving a size reduction of nearly 7:1
Why 216 bytes per frame?
In order for this method to sound ok, ideally it needs to be pushed at a higher sample rate (It is not feasible to use this at lower sample rates unlike ssdpcm2) using more frequent updates for the step sizes would increase the quality however.
total amount of cycles per frame on pal is 19656. 19656/216 gives exactly an integer value (91) which would allow the nmi to pretty much update at exactly the same amount per frame. (Of course there are ways to constantly adjust dd04 per frame to give an approximation of the fractional number - and does work, but opted for the other approach instead.
disk streaming
The next issue was to get the streaming to work well..
The nmi uses 34 cycles (including jmp $dd0c to save a cycle). The translation table to write to d418 is within the nmi and only saves/restores accumulator. A cycle gain per update would have been possible if including the translation within the decoder, but ram space was limited (as the decoder uses custom code per 8bit pattern for speed gain) This would allow to save nearly 4 raster lines per frame.. not much.
Would have been possible to save 5 cycles per update if using Y or X register only in nmi and not saving/restoring it in nmi, but it would result in it not being feasible to use this particular register outside the nmi because of this.
Hence 34 cycles + some cycles latency on average would consume 38 cycles or so per update.
Does not sound like much, but 216 updates per frame x 38 cycles is 8208 cycles which equates to over 130 raster line usage just for the nmi update (playing from a page buffer)
Combine this with the decoder in irq that decompresses the sample in real-time and this results in even more usage.
This leaves not much cpu time for the loader and cuts down the approximate speed to around 2k a second or less.
Now this would not be a problem if there were many repeated consecutive sample loops (that would allow the loader to load next segment in time) but there can be over half a dozen or more unique 4 bar patterns consecutively played back.
Now you may calculate that 10800hz is packed to 1650 bytes and is enough time to load consecutively.. wrong.
There are mechanical delays in the floppy drive. In particular i am also using over 50 files on the disk and 42 files can be cached this way using bitfire. Also as samples can be loaded from different area's of the disk, further delays can be present. This was found out the hard way when noticing that Vice, Turbo Chameleon and 1541u2 does not have the mechanical floppy delay implementation.
To resolve the issue, I have implemented dual stage caching so that even if a trigger is made to load next file while it is still loading the previous, it will be able to do this just fine. However, i have given some "breathing" space in-between per two chunk loads to ensure a fresh new batch of load requests will not result in overrun. Utilising more buffer slots and reorganising order of chunks to load also ensure that specific amount of chunks are loaded in time.
overall together with the sequencing, a whole 180 second (3 minutes) track has been condensed to less than 1k a second with 11khz sample playback rate. Approximately 164k of sample data on the disk.
effects
Now onto the effects. There is not much ram left and I can only use a very small amount of cpu time per frame due to the ssdpcm decoder running per frame + nmi, hence what you see on the screen are just some visualizer based effects.
These are adjusted based on the actual sample playing and the various effects are synced based on the sample segments that are playing back.
Sample output
The actual writes to produce the sample via nmi is using the mahoney method which allows a single write to d418 via relevant amplitude table producing higher than 4bits playback. There is AutoDetect in the demo that should adjust based on whether or not your c64 has old sid or new sid. Warning however that the old sid results sound more distorted and varied. For higher quality, recommended to use newsid revision (although this may also vary on different c64 machines). If using any emulator, make sure you select resid and 8580. It will not work on fastsid emulation. - isokadded on the 2017-06-28 18:25:47
- demo SEGA Genesis/Mega Drive Overdrive 2 by Titan [web]
- Oozing with brilliance all the way throughout!
- rulezadded on the 2017-05-07 01:27:37
- demo Commodore 64 Frodigi 8 by Algotech
- around 15-20% of cpu time left (but for slightly worse audio) with carry errors in the mixing (lower bits affected) it would save me an extra 40 rasterlines or so. giving around 25-30% free.
I needed that remaining rastertime to stream the data from disk (although the loader did a good job and was finished way before the next segment was needed - isokadded on the 2017-01-25 20:27:43
- demo Commodore 64 Frodigi 8 by Algotech
- ah, and one more thing. speech.. This encoder is not aimed at speech only. There are far more efficient methods of packing speech only based data than this one (even if it seems to do an ok job at it)
- isokadded on the 2017-01-25 19:28:50
- demo Commodore 64 Frodigi 8 by Algotech
- Few notes as below (also this is in the actual demo text, doesnt anyone read first before commenting?? and yes, i know the quality is not good (although would be hard pressed to even consider a lossless sample at 5200 good either). (if you want crystal clear, listen to my ssdpcm-mp method on SSDPCM-MP amiga demo, or Algodreams ssdpcm-mp part)
Excerpt from the demo itself..
"Q: why is the quality so horrible? a: this is ultra low bitrate and only uses 8 sines to recreate the whole audio. whole concept of the frodigi series is to attempt to pack audio at an extremely small size and then to gradually increase the quality with each series while still maintaining similar data size. quality is also fully dependant on source audio. speech can sound quite decent, but having chords, drums and other instruments together with the speech is more of a task with just the limited amount of sine channels. "
Some other notes..
In order to have the 8 channel mixing with screen on and streaming and semi depacking, many shortcuts needed to be done on the c64 side which in turn results in even worse quality (pitch adjustment method, non interpolation etc). Still as an exercise, encode a mp3 file at 8kbs or so and compare with this 4kbs implementation? (Not sure if mp3 can go any lower than 8kbs) Again, this is not meant as a demo, but more as a proof of concept.
The PC encoder for non-c64 use produces far higher quality at the same bitrate (4kbs or lower down to 1kbs). on the c64 side however, 5200hz for 8 channels leaving some cpu time and screen on for loader and some other accompanying text-displayers. (much higher is possible if screen is off and the main decode is in closed loop) but that somewhat limits things further. - isokadded on the 2017-01-25 19:27:07
- demo Commodore 64 Frodigi 8 by Algotech
- Yzi. There is no phase information in the encoded file. As long as sustain can be controlled per channel, its not an issue at all (provided that you have individual frequency control per channel as well)
- isokadded on the 2017-01-24 22:16:12
- demo Commodore 64 Frodigi 8 by Algotech
- Should autodetect sid type. Results may vary quite a bit on old sid.
For Digimax mode (Make sure it is set to $de00) Hold down space while booting. All production note texts included in the demo. - isokadded on the 2017-01-24 01:26:46
account created on the 2011-01-25 21:59:22