File I/O performance

Chat about anything CX16 related that doesn't fit elsewhere
User avatar
StephenHorn
Posts: 565
Joined: Tue Apr 28, 2020 12:00 am
Contact:

File I/O performance

Post by StephenHorn »



33 minutes ago, TomXP411 said:




** I figured out the difference. My first test used the READST API. The second time around, I just did an LDA FILE_STATUS, which reads directly from the system RAM location. This is unsafe, as the location may change. 



What surprises me is that READST is that inefficient. It added something like 30% to the overall runtime of the routine. 



 


4 minutes ago, TomXP411 said:




So I stepped through the KERNAL code. It's just a lot of code. I kind of lost track at 100+ steps in the KERNAL.



Woah, how'd that happen? I'm digging through disassembled code in the emulator, and if you're starting from ROM bank 0, that should be:


Quote




JSR $FFB7 ; 6 cycles, calling READST



JMP $D646 ; 3 cycles, READST jumps into the kernal implementation

LDA $0286 ; 4 cycles

ORA $0286 ; 4 cycles

STA $0286 ; 4 cycles

RTS ; 6 cycles, returning to the calling code



Edit: If you're not starting from ROM bank 0, say you're in ROM bank 4 (BASIC), there may be hoops it jumps through to trampouline into the ROM implementation. Or there may be another implementation altogether that assumes we're running a BASIC command...

Developer for Box16, the other X16 emulator. (Box16 on GitHub)
I also accept pull requests for x16emu, the official X16 emulator. (x16-emulator on GitHub)
TomXP411
Posts: 1785
Joined: Tue May 19, 2020 8:49 pm

File I/O performance

Post by TomXP411 »



2 minutes ago, StephenHorn said:




Woah, how'd that happen? I'm digging through disassembled code in the emulator, and if you're starting from ROM bank 0, that should be



Crud. Now I need to test both methods again. I am thinking I might have recorded the result wrong on my first test. 

I do have to wonder at what that code is doing, though... why OR and save the value back? ORing a number with itself is just... the same number. 

 

User avatar
StephenHorn
Posts: 565
Joined: Tue Apr 28, 2020 12:00 am
Contact:

File I/O performance

Post by StephenHorn »



4 minutes ago, TomXP411 said:




Crud. Now I need to test both methods again. I am thinking I might have recorded the result wrong on my first test. 



 



 



An easy thing to miss in ASM projects is that "RUN" on a SYS command still leaves you in ROM bank 4 to run the BASIC command. So an ASM-based program probably wants to manually set its ROM bank to 0 fairly early in execution. This also makes the interrupt handler much, much lighter as well, since it doesn't rely on BASIC code to catch the interrupt and then trampouline into proper kernal handling.

That said, I have no idea whether it's possible to exit gracefully after that by resetting the ROM bank to 4 before your program's final JSR instruction.

Developer for Box16, the other X16 emulator. (Box16 on GitHub)
I also accept pull requests for x16emu, the official X16 emulator. (x16-emulator on GitHub)
TomXP411
Posts: 1785
Joined: Tue May 19, 2020 8:49 pm

File I/O performance

Post by TomXP411 »


Thanks, I'll keep that one in mind. This particular test is bookended in BASIC code to manage the timing.

Anyway, I looked at the original Commodore ROM, and it's the same. However, there is an additional label there, which updates the status byte. So it looks to me like someone decided to save some ROM space by cramming the read and update code together. 

https://www.pagetable.com/c64ref/c64disasm/#FE07

And yes - it really does add that much time. 208 Jiffys using JSR READST and 155 Jiffys using LDA $0286

I don't think it's worth worrying about for small files, but when you're loading 100+KB into banked RAM, I think that will make a noticeable difference. So it's something to bear in mind. 

On the bright side, I don't think there's actually anything to lose by reading past EOF... you'll just get back nulls, and if you expect a certain block size in your file, that's not an issue. So calling READST once every block, rather once every byte, is going to save some CPU time. 

 

User avatar
StephenHorn
Posts: 565
Joined: Tue Apr 28, 2020 12:00 am
Contact:

File I/O performance

Post by StephenHorn »


Ah, I was wondering what was up with the ORA and STA before getting back to the RTS. That makes sense, and allows them to save a byte from SETMSG as well, since that too runs straight through READST and UDST. Clever. The academic in me wonders if they secretly depend on that execution path anywhere in the kernal. The horrified engineer in me is wondering if those dependencies are documented. ?

Developer for Box16, the other X16 emulator. (Box16 on GitHub)
I also accept pull requests for x16emu, the official X16 emulator. (x16-emulator on GitHub)
TomXP411
Posts: 1785
Joined: Tue May 19, 2020 8:49 pm

File I/O performance

Post by TomXP411 »



33 minutes ago, StephenHorn said:




Ah, I was wondering what was up with the ORA and STA before getting back to the RTS. That makes sense, and allows them to save a byte from SETMSG as well, since that too runs straight through READST and UDST. Clever. The academic in me wonders if they secretly depend on that execution path anywhere in the kernal. The horrified engineer in me is wondering if those dependencies are documented. ?



It's a runon from SETMSG, which you wouldn't want to change. So we'd need a new API entry. 

However, I'm thinking the smart thing is to avoid the vector table and ROM completely and just read the STATUS byte directly from RAM. Of course, to do that, we need to know the address of STATUS. So I'm thinking the best idea is a new API that gives us a pointer to STATUS. 

Get Status Pointer, or GETSTPTR

This would save the location of the STATUS variable in zero page, at whatever address the user wants:

This would take X as input and save the STATUS variable in byte X of Zero Page.

GETSTPTR:

    LDA STATUS

    STA 0,X

    LDA STATUS+1

    STA 1,X

To call the setup routine in user code:

LDA X,#MYSTATUS

JSR GETSTPTR

and then programs can simply query STATUS with...

Loop:

    JSR GETIN

    do stuff

    LDA (MY_STATUS)

    BEQ Loop



 

 

SlithyMatt
Posts: 913
Joined: Tue Apr 28, 2020 2:45 am

File I/O performance

Post by SlithyMatt »



6 hours ago, StephenHorn said:




I have no idea whether it's possible to exit gracefully after that by resetting the ROM bank to 4 before your program's final JSR instruction.



It is. That's what I do in most of my programs, and it works just fine as long as I have the bank set back to 4 before the RTS that should take us back to BASIC. However, certain RAM and VRAM states may make that difficult, so there may need to be other resetting needed.

Stefan
Posts: 456
Joined: Thu Aug 20, 2020 8:59 am

File I/O performance

Post by Stefan »


Interesting stuff. 

I have three test programs to verify this, all reading the 73 kB source code file provided by @desertfish in the native assembly thread.


  1. The original test program, interrupt disabled


  2. The same + changing ROM bank to 0 at start


  3. As no 2 + but reading the status directly from memory (status address = $0286, hope I didn't get that mixed up).


I clocked these manually (average of three runs per test program):


  1. 7.8 s (7.76 s + 7.76 s + 7.75 s) => 9.3 kB/s


  2. 4.4 s (4.33 s + 4.30 s + 4.42 s) => 16.6 kB/s (78 % faster)


  3. 4.1 s (4.12 s + 4.07 s + 4.12 s) => 17.8 kB/s (91 % faster)


The bank switching really seems to be the culprit. The READST function is also hitting the performance, but not nearly as much as the bank switching.


test_banking_status.asm
test_banking.asm
test.asm
Stefan
Posts: 456
Joined: Thu Aug 20, 2020 8:59 am

File I/O performance

Post by Stefan »


A side note.

The ROM version of X16 Edit I'm working cannot avoid bank switching when reading files. The code is in ROM bank 7, and for each byte it reads it must go the Kernal ROM bank 0.

However, I'm not using the Kernal JSRFAR function, but my own minimal bank switching code stored in low RAM ($0400-7FFF). Opening the same 73 kB file is done in about 5.3 s => 13.8 kB/s. The editor RAM version doesn't have to do the bank switching, and is a bit faster, loading the file in about 4.5 s => 16.2 kB/s.

This is the bank switching code. Before calling this routine the address in jsr $ffff needs to be changed to the address you actually want to call. I use a macro to make this "self modification" safe.


bridge_kernal:



stz ROM_SEL ;Kernal is ROM bank 0



jsr $ffff ;$ffff is just placeholder



pha



lda #ROM_BNK ;Set ROM select to our bank again



sta ROM_SEL



pla



rts ;14 bytes


 

TomXP411
Posts: 1785
Joined: Tue May 19, 2020 8:49 pm

File I/O performance

Post by TomXP411 »



On 1/15/2021 at 10:57 PM, Stefan said:




A side note.



The ROM version of X16 Edit I'm working cannot avoid bank switching when reading files. The code is in ROM bank 7, and for each byte it reads it must go the Kernal ROM bank 0.



However, I'm not using the Kernal JSRFAR function, but my own minimal bank switching code stored in low RAM ($0400-7FFF). Opening the same 73 kB file is done in about 5.3 s => 13.8 kB/s. The editor RAM version doesn't have to do the bank switching, and is a bit faster, loading the file in about 4.5 s => 16.2 kB/s.



This is the bank switching code. Before calling this routine the address in jsr $ffff needs to be changed to the address you actually want to call. I use a macro to make this "self modification" safe.




bridge_kernal:



stz ROM_SEL ;Kernal is ROM bank 0



jsr $ffff ;$ffff is just placeholder



pha



lda #ROM_BNK ;Set ROM select to our bank again



sta ROM_SEL



pla



rts ;14 bytes




 



You might save some time by reading larger blocks -  maybe copying your I/O code to low RAM when the editor starts. 

Post Reply