Commander X16

Posted: **Fri Jan 15, 2021 2:55 am**

I was curious how fast the Commander can actually read files from disk, so I fired up the emulator and did some quick tests.

I wrote a very simple assembly routine that lives in Golden RAM (the section at $400), and timed it to see how fast it ran.

With a 32K file, it takes 155 ticks, or 2.58 seconds. That makes file read throughput 12.7KB/s.

What does that mean, in terms of real numbers?

You can fill the BASIC program area ($800-$9EFF) in 3 seconds

An 8K bank will load in 0.64 seconds

You can load all 512KB of banked memory in about 42 seconds..

This is far less than the CPU's theoretical maximum throughput, and it averages to roughly 125 machine language instructions per byte read. I'm not sure if the time spent is just due to the FAT32 driver, or if there are some delays in the emulated hardware. That's something else to look at.

In the meantime, the next step would be to evaluate the speed of popular decompression routines. Exomizer and PuCrunch seem to be the most common 6502 compression systems right now.

** Edit: this has caused a little confusion, since I revised these numbers. The original result was around 9.8KB/s, and used the code attached. I got a marginal increase in performance by skipping the READST KERNAL routine and reading $286 directly. The problem here is that $286 is not frozen and could change.

So I'm going to request (or supply) a small change to the KERNAL to return the current address of the ST variable, so we can query it directly.

chkin.asm

Posted: **Fri Jan 15, 2021 9:29 am**

In the file assembler topic we had some peculiar findings regarding i/o speed as well . To me, it seems that there is something going on in the emulator that skews the results

Posted: **Fri Jan 15, 2021 9:57 am**

Huh. I was looking for that conversation and didn’t realize it was part of the native assembler conversation.

I should probably look at the code in the emulator and see if it’s adding a delay... I had that experience with a Tandy 100 emulator a few months ago.

Posted: **Fri Jan 15, 2021 6:15 pm**

The emulator shouldn't be adding any delays to file I/O. If you're timing wall-time, you might be seeing efficiency problems in the emulator's implementation, but I'm not sure why they'd go away in the interrupt handler. The interrupt handler looks no different to the emulator code, it's still just running a loop to process CPU instructions and update the state of various emulated subcomponents.

I'm guessing that at least part of the explanation is that doing loads in the interrupt mean that the loads aren't being interrupted on every vsync when the VERA signals a new frame has occurred. 4 second load = 240 interruptions from VSYNC, during which the kernal does its ordinary per-frame stuff. If you want to make sure a load outside of the interrupt handler isn't interrupted, insert an SEI instruction before starting it. Just make sure to also have a CLI instruction afterwards.

Posted: **Fri Jan 15, 2021 6:21 pm**

The weird thing is that doing SEI + CLI outside the interrupt handler doesn't save much time. It's still a mystery to me.

Posted: **Fri Jan 15, 2021 9:17 pm**

3 hours ago, StephenHorn said:

The emulator shouldn't be adding any delays to file I/O. If you're timing wall-time, you might be seeing efficiency problems in the emulator's implementation, but I'm not sure why they'd go away in the interrupt handler. The interrupt handler looks no different to the emulator code, it's still just running a loop to process CPU instructions and update the state of various emulated subcomponents.

I'm guessing that at least part of the explanation is that doing loads in the interrupt mean that the loads aren't being interrupted on every vsync when the VERA signals a new frame has occurred. 4 second load = 240 interruptions from VSYNC, during which the kernal does its ordinary per-frame stuff. If you want to make sure a load outside of the interrupt handler isn't interrupted, insert an SEI instruction before starting it. Just make sure to also have a CLI instruction afterwards.

I'll give that a try right now (although it sounds like @stef

Posted: **Fri Jan 15, 2021 9:18 pm**

Re compression. Doesn't our nice Kernal have some shiny new decompression routines built-in?

Posted: **Fri Jan 15, 2021 9:35 pm**

13 minutes ago, rje said:

Re compression. Doesn't our nice Kernal have some shiny new decompression routines built-in?

I saw something about that. It's worth investigating.

So here's the thing: the 9.8KB/s I measured means an average of 160 or so instructions per byte read. If the compression takes, say, 80 instructions per byte, then it's going to need to compress at 2:1 or better to be worth the effort. I think this will all come down to performance testing (and, of course, a determination of whether the emulator is behaving accurately.)

Posted: **Fri Jan 15, 2021 10:19 pm**

4 hours ago, Stefan said:

The weird thing is that doing SEI + CLI outside the interrupt handler doesn't save much time. It's still a mystery to me.

Same. I got around 2.5-2.7 seconds with interrupts disabled. It was hard to measure accurately, because without interrupts, TI doesn't get updated.

With interrupts enabled, I get 2.58 seconds. So the overhead of the interrupt handler seems minimal.

Although I'm going to have to revise my earlier numbers. With the simplest possible loop, I'm getting 32K In 2.58 seconds, which adds up to 12.7KB/s. I'll go back and revise my initial post.

That's still an order of magnitude less than the theoretical maximum, but 12KB/s is better than my original result. I can't explain the difference, though, and that bothers me.

** I figured out the difference. My first test used the READST API. The second time around, I just did an LDA FILE_STATUS, which reads directly from the system RAM location. This is unsafe, as the location may change.

What surprises me is that READST is that inefficient. It added something like 30% to the overall runtime of the routine.

Posted: **Fri Jan 15, 2021 10:48 pm**

So I stepped through the KERNAL code. There's no mystery here, after all... the FAT driver is just a lot of code. I kind of lost track at 100+ steps in the KERNAL.

I think performance can be improved by buffering data to RAM, which should bring individual character reads down to about 10 instructions. Obviously there's time involved in buffering each block, but since most of the code would be setting up a sector read, that's much more efficient when read a whole sector into memory at once.

Commander X16

File I/O performance

File I/O performance

File I/O performance

File I/O performance

File I/O performance

File I/O performance

File I/O performance

File I/O performance

File I/O performance

File I/O performance

File I/O performance