VGA mode-x latch copy (attn scali, trixter, yzi, wbc etc)
category: code [glöplog]
I've been working on some mid-90s tech.
To try out unchained mode techniques, I made some Kefrens bars -- but not the usual beam-racing kind: On each row, I use the latches to duplicate the previous row (80 bytes) then draw the head of the next bar over the copied row.
It runs at 60hz in $emulators, but when I try on $realhw it's slow. It takes the entire 16ms frame-time to render a frame, almost all of which is spent in the latch copy.
(240*80) * 2(read then write) * 60hz = ~2.3MB/sec needed for this effect. That's not possible on a PCI bus?
I expected VRAM access to be slow, but this seems so slow as to be unusable.
Am I missing something?
code: https://github.com/usrlocalben/dosfun/blob/master/app_kefrens_bars.cpp#L48
$emulators = { dosbox, doxbox-x, pcem }
$realHW = thinkpad 760ELD, p100mhz, trident 9320
To try out unchained mode techniques, I made some Kefrens bars -- but not the usual beam-racing kind: On each row, I use the latches to duplicate the previous row (80 bytes) then draw the head of the next bar over the copied row.
It runs at 60hz in $emulators, but when I try on $realhw it's slow. It takes the entire 16ms frame-time to render a frame, almost all of which is spent in the latch copy.
(240*80) * 2(read then write) * 60hz = ~2.3MB/sec needed for this effect. That's not possible on a PCI bus?
I expected VRAM access to be slow, but this seems so slow as to be unusable.
Am I missing something?
code: https://github.com/usrlocalben/dosfun/blob/master/app_kefrens_bars.cpp#L48
$emulators = { dosbox, doxbox-x, pcem }
$realHW = thinkpad 760ELD, p100mhz, trident 9320
hi, and welcome to MS-DOS playground! :)
first of all, VRAM reads are indeed awfully slow, they tend to be about 10-15 times slower than writes (because writes are buffered in CPU/chipset allowing to issue them more frequently and faster, while read will stall the whole datapath until data arrival to the CPU)
second, try to not use or at teast minimize amount of high level (such as C++) code in innerloops; Watcom C++ supports inline _asm blocks, so you can write loop at line 48 as:
, it'll do exactly the same but without loop overhead; MOVSB is fine as VGA graphics controller trickery is operating with bytes (keeping bitplanes in mind, i'm takling about CPU side :), and since we allow only latched data to write to VGA memory, thus CPU-side data are ignored and don't care.
third, kb's tinymod is pretty CPU-heavy and will probably bottleneck your demo on slower machines, so it's better to use native module replayers.
first of all, VRAM reads are indeed awfully slow, they tend to be about 10-15 times slower than writes (because writes are buffered in CPU/chipset allowing to issue them more frequently and faster, while read will stall the whole datapath until data arrival to the CPU)
second, try to not use or at teast minimize amount of high level (such as C++) code in innerloops; Watcom C++ supports inline _asm blocks, so you can write loop at line 48 as:
Code:
_asm {
mov ecx, 80
mov esi, [prevPtr]
mov edi, [rowPtr]
rep movsb
}
, it'll do exactly the same but without loop overhead; MOVSB is fine as VGA graphics controller trickery is operating with bytes (keeping bitplanes in mind, i'm takling about CPU side :), and since we allow only latched data to write to VGA memory, thus CPU-side data are ignored and don't care.
third, kb's tinymod is pretty CPU-heavy and will probably bottleneck your demo on slower machines, so it's better to use native module replayers.
With rep movsb I achieved a few additional rows at 60hz, but not many.
I think you clarified the issue--in that it's a read-latency problem and not a bandwidth limit.
Thanks for that.
re: tinymod, you may have missed that I removed the PWM emulation, although there's still plenty of FPU ops that could be eliminated. (and I added interpolation which probably isn't worth it)
Does the GUS have a timer? How do you get e.g. a 50hz GUS player when the PIT is being used for a 60hz VBI irq? (yes, one could compose a module for 60hz, but never-mind that)
I think you clarified the issue--in that it's a read-latency problem and not a bandwidth limit.
Thanks for that.
re: tinymod, you may have missed that I removed the PWM emulation, although there's still plenty of FPU ops that could be eliminated. (and I added interpolation which probably isn't worth it)
Does the GUS have a timer? How do you get e.g. a 50hz GUS player when the PIT is being used for a 60hz VBI irq? (yes, one could compose a module for 60hz, but never-mind that)
GUS does have multiple timers:
2 8-bit standard adlib timers incremented every 80 µs / 320 µs
Or use a spare voice and use the sample or volume ramping counter which are incremented at every mixer cycle and thus gives more flexibility and precision
2 8-bit standard adlib timers incremented every 80 µs / 320 µs
Or use a spare voice and use the sample or volume ramping counter which are incremented at every mixer cycle and thus gives more flexibility and precision
If you are targeting AT or newer machines, there is a second timer available:
The MC146818 CMOS timer, at port 70h-7Fh.
It can also generate a timer interrupt.
It is somewhat more limited than the 8253 PIT though, it only has a very limited set of dividers for the base-clock:
http://www.futurlec.com/Datasheet/Motorola/MC146818.pdf
But still, you could set a reasonably high timer frequency and do an extra divider in your int handler, so you can still get approximately 50 Hz.
The MC146818 CMOS timer, at port 70h-7Fh.
It can also generate a timer interrupt.
It is somewhat more limited than the 8253 PIT though, it only has a very limited set of dividers for the base-clock:
http://www.futurlec.com/Datasheet/Motorola/MC146818.pdf
But still, you could set a reasonably high timer frequency and do an extra divider in your int handler, so you can still get approximately 50 Hz.
You're better off drawing the stuff in a normal RAM buffer and then dumping that to VRAM in one go, if it's a PCI capable machine. Uninteresting bruteforce solution :(
@Marq yup the ram buffer is less interesting. it still needs to be split into planes so that the vram stores are dwords, so there's that.
60hz with that change. but yes--now it's boring.
60hz with that change. but yes--now it's boring.
You can use the timer interrupt and GUS for any bmp tempo even together with a timer-based VBI. Here's an example code snippet. Call update_music() from your timer interrupt, and give the amount of PIT clock cycles elapsed since the last update in the cycles_elapsed_since_last_update parameter.
If your timer interrupt rate is slower than the BPM tempo "CIA timer" rate, then you should change the if statement to a while loop, but I recommend having timer interrupts and an update_music() call at least twice for every PC screen refresh, so you won't have get that problem. Normal PC video modes are either 60 Hz or 70 Hz, so if you do a second interrupt in the middle of the screen, you'll be updating the music at 120 or 140 Hz, which will be perfectly OK.
Code:
static volatile int music_time_accumulator = 0; // this is in PIT cycles (1.19 MHz)
static volatile int music_tick_duration = 23863; // 50 Hz, default for Amiga MODs if BPM tempo is not specified
static volatile int music_ticks = 0; // music time in music ticks (e.g. 50 ticks per second)
// Call this routine from an interrupt handler as often as you like.
// A new music tick will be played whenever it’s time to do so.
void update_music(int cycles_elapsed_since_last_update)
{ // play music if it's time to play another tick
music_time_accumulator += cycles_elapsed_since_last_update;
if (music_time_accumulator >= music_tick_duration)
{
music_time_accumulator -= music_tick_duration;
pt_play(); // <-- music playroutine, plays stuff on e.g. GUS
// pt_play() may have changed BPM tempo, so update tick duration.
if (pt_bpmtempo>0) {
music_tick_duration = ((1193181*5)/((int)pt_bpmtempo*2));
}
music_ticks++;
}
}
If your timer interrupt rate is slower than the BPM tempo "CIA timer" rate, then you should change the if statement to a while loop, but I recommend having timer interrupts and an update_music() call at least twice for every PC screen refresh, so you won't have get that problem. Normal PC video modes are either 60 Hz or 70 Hz, so if you do a second interrupt in the middle of the screen, you'll be updating the music at 120 or 140 Hz, which will be perfectly OK.
It seems like one wouldn't need to be extraordinarily gifted to notice the jitter from that method, esp in arpeggios, retrigs or tempo-swing. with 50hz @60hz. maybe at 120hz it could be ok.
I could render a wav with that timing and hear for myself.
I'm satisfied with the sbpro/sb16 for now (and I sold my GUS on ebay a long time ago)
I could render a wav with that timing and hear for myself.
I'm satisfied with the sbpro/sb16 for now (and I sold my GUS on ebay a long time ago)
Oh, so you weren't actually going to do anything with the information anyway.
Quote:
Code:_asm { mov ecx, 80 mov esi, [prevPtr] mov edi, [rowPtr] rep movsb }
No, that should be:
mov ecx,20
...
rep movsd
I know the bus is still the overall limiting factor, but at least use 386 instructions on a 386...
@trixter
If I understand correctly, the latches are only 8-bits wide so that would effectively be:
$latch = vram@0
$latch = vram@1
$latch = vram@2
$latch = vram@3
vram@0 = $latch (==vram@3)
vram@1 = $latch (==vram@3)
vram@2 = $latch (==vram@3)
vram@3 = $latch (==vram@3)
or maybe some other undefined behavior.
If I understand correctly, the latches are only 8-bits wide so that would effectively be:
$latch = vram@0
$latch = vram@1
$latch = vram@2
$latch = vram@3
vram@0 = $latch (==vram@3)
vram@1 = $latch (==vram@3)
vram@2 = $latch (==vram@3)
vram@3 = $latch (==vram@3)
or maybe some other undefined behavior.
I think it depends on the VGA chipset used, but I'm reasonably sure that at least some of them will operate the same with rep movsb, movsw and movsd.
movsd/stosd produce the same results (i.e. broken) on my trident hw, and dosbox.
did you (obvious one, but hey) adjust the rep count accordingly?