Benchmarking SD card transfer speeds

ZeroByte · Post by **ZeroByte** » Fri Dec 17, 2021 6:11 pm

There's been some occasional discussion about what sort of data transfer speeds the X16 will have when loading from / saving to the SD card.

Since I have a patched Kernal image that is able to LOAD into HIRAM, I decided to do a speed test on it for a reasonably large file and see what sort of speed the LOAD routine has. Note that the HIRAM destination uses a slower loop that has to do a lot of bank flipping between the Kernal's working bank 0 and the destination bank, and watch for bank wraps, etc. Load into low RAM is faster.

Test conditions:

I used Box16's debugger to count clock cycles over the course of a few LOAD tests. The start point is when the actual Kernal LOAD routine is entered (after the jump from the API table) and the end is when the final RTS is completed. All file sizes include the PRG header, as the Kernal must process these bytes even if they aren't placed into RAM. The times do not include the time spent performing the SETNAM and SETLFS calls required to actually perform a load. The raw results are below after my conclusions.

Observations:

* While these times exclude the time spent calling SETNAM and SETLFS, if you're thinking these numbers could be improved on by streaming from a single open file a few bytes at a time, the answer is no. As far as I know, there's no existing API call to the Kernal for performing a block read of a specified byte amount. (If I'm wrong, please let me know) LOAD does the entire file in one shot.

* The overhead of starting a LOAD is extremely significant. Smaller files achieve a much lower bytes per second average throughput. For example, the large transfer into HiRam achieves 1533.8 bytes per second, while loading a single file of 1534 bytes into HiRAM only achieves 54.7% of that speed.

Conclusion:

If you're looking to stream audio or video data, this is the kind of performance you can expect to be working with. Keep in mind that these speeds are what is achieved when LOAD is able to consume nearly 100% of the CPU. If I were to repeat these benchmarks while banking on the keyboard to generate PS/2 transfers, these speeds would be MUCH worse. I would expect the transfer speeds for each LOAD size to scale linearly with the amount of CPU available to the routine, but that's only conjecture at this point. I'd need to come up with something more rigorous than banging the keyboard as a way to steal CPU cycles from LOAD.

And for those still reading - here's the numbers:

Large LOAD to Hi RAM Test:

File Size: 135,630 bytes

Total CPU cycles = 11,790,070

Total time = 1.4738sec

Bytes/sec = 92,028.6

Bytes/frame = 1,533.8

Large LOAD to LoRAM Test:

(N/A)

32K LOAD to Hi RAM Test:

File Size: 32,768 bytes

Total CPU cycles = 2,920,036

Total time = 0.365sec (21.9 frames)

Bytes/sec = 89,774.2

Bytes/frame = 1,496

32K LOAD to Lo RAM Test:

File Size: 32,768 bytes

Total CPU cycles = 2,408,108

Total time = 0.301sec (18.06 frames)

Bytes/sec = 108,858.9

Bytes/frame = 1,814.3

Frame-sized LOAD to HiRAM Test:

File Size: 1,534 bytes

Total CPU cycles = 266,818

Total time = 0.0334sec (2 frames)

Bytes/sec = 45,993,9

Bytes/frame = 766.6

Frame-sized LOAD to LoRAM Test:

File Size: 1,534 bytes

Total CPU cycles = 243,435

Total time = 0.0304sec (1.826 frames)

Bytes/sec = 50,411.8

Bytes/frame = 840.20

Edmond D · Post by **Edmond D** » Fri Dec 17, 2021 6:35 pm

Thanks for investigating and publishing your results. I think this type of information will benefit all who are developing applications for the community.

ZeroByte · Post by **ZeroByte** » Fri Dec 17, 2021 6:43 pm

Here's a chart of the results:

I went ahead and benchmarked 8KB file size as well, since that's a logical "block size" to use if breaking a file up into loadable chunks - i.e. one bank. (technically, I should've use 8194 bytes, but that's going to be negligible difference)

desertfish · Post by **desertfish** » Fri Dec 17, 2021 7:23 pm

Nice work! Good to see that loading to banked HiRAM is not a lot slower.

ZeroByte · Post by **ZeroByte** » Fri Dec 17, 2021 7:39 pm

It was a lot closer than I expected it to be. Obviously, that's a difference per frame, which adds up over real time onto the human scale of "load times" when loading assets and so forth....

For the sake of interest, I decided to add VLOAD times to the benchmark, and found something interesting. VLOAD is MUCH slower - I think VLOAD uses the byte-by-byte method and not the burst transfer method. This makes sense, as the burst transfer method writes directly into an incrementing RAM buffer. It might be possible to patch DOS and fat32.s to make block transfer support VRAM as a target for block transfers, but I'm not ready to tackle that right now.

The VLOAD speed results:

100K = 251 bytes/frame

32K = 249.9 bytes/frame

8K = 244.6 bytes/frame

1534 bytes = 215.9 bytes/frame

(updated the chart to include VLOAD comparison)

ZeroByte · Post by **ZeroByte** » Fri Dec 17, 2021 8:01 pm

Here's a representation of the actual load times for various sizes/destinations.

I used 100KB as the "large" size in the HiRam LOAD so as to get an apples-to-apples comparison vs. VLOAD.

And yes, the times are in seconds - it takes roughly 7 seconds to load a full VRAM image using LOAD directly. Note that you could load it into HIRAM and then blit that into VRAM faster than you can VLOAD it directly.

Obviously, this is all based on how the Kernal works under the hood, and VLOAD could be made faster if DOS is updated to support it as a block-transfer target.

ZeroByte · Post by **ZeroByte** » Fri Dec 17, 2021 10:05 pm

I suppose if one's unhappy with Kernal's performance, there's nothing stopping you from making your own fat32 routine and talking directly to the SD card using the SPI registers on VERA....