Proposal for a hardware-agnostic math accel API
Posted: Thu Feb 15, 2024 1:59 am
I've been chatting about this on the Discord a bit in #homebrew-hardware and thought I'd collect my initial thoughts after batting some ideas around on what might be kind of interesting, at least as a thought experiment, of what a hardware agnostic math accelerator API might look like which could be implemented any number of ways.
Couple points first though:
Anyways, while looking into things like FemtoRV, UpDuino (and other Lattice-based FPGAs) for the above, I soon realized some of these have enough pins to connect to the entire X16 bus (including some important additional pins, like IRQB, RDY, etc.). So my first/grand idea was to sort out what a RISCV co-processor on a card might look like. It would run inside an FPGA also with a soft-SOC, have it's own RAM and map the X16's RAM into its larger 32-bit address space. The idea being the X16 CPU will still be used for KERNAL calls where possible (some things might have be implemented in RISCV if needed, like a keyboard routine, else control would be passing back and forth) but issuing a call to the MMIO address of the card could instruct some front-end logic to pause the 6502 (via RDY), perhaps setup the value of the return, and then pass control to the RISCV core.
It seems like a crazy idea, but a neat one, though it's beyond my capability and is fairly esoteric in many ways. Given conversations on the Discord have come up about accelerator cards, I started thinking about ways to simplify this while still providing benefit. Just so happens the Lattice FPGA used by VERA is the same one used by the Upduino I'll be using initially. It has some on-board SRAM among other nice things. Looking at how the VERA is structured, it kinda points to an interesting solution.
Instead of hooking a RISCV up to the whole bus, one can just hook up the minimum part of the bus for the IO space (the IO pin plus the lower 5 address bus lines) and pass data back and forth through MMIO. In this way, one has a rudimentary set of registers with an instruction and status - so more or less a special purpose accelerator processing unit.
All told I came up with this:
As a crude example, one might do something like the following:
I've been saying FPGA here but it was pointed out there are potential non-FPGA solutions (like the RP2040) which could respond the same way. Enter the "agnostic" part.
This could be used for both simple or complex things, like multiply, divide, floating point, matrix/vectors even, etc. The implementation details could be up to the designer implementing the API. In my case, I could either implement a RISCV soft-core and write the math in software or could make some bespoke logic structures directly on the FPGA making a custom simple processor. From the standpoint of the X16, it just sees some memory addresses.
An more extreme case blending the original idea with this one, is having some operations optionally throw data directly to VERA (by pausing the 6502). That would require some additional address lines (though still not the entire bus since VERA is up in the I/O space as well) but if needing to do complicated matrix calculations or some such, that might be perhaps an option. Nice thing about an API over an entire RISCV coprocessor sitting on the bus, is the API can be more easily implemented in the emulators as well without an entire RISCV soft-core and all the complex handling of passing control back and forth.
It also could be used in a cartridge as an accelerator. This is where the RP2040 was mentioned given the cost and performance that would probably be in close to performance of the FPGA given the X16 bus speeds.
And while I think the RISCV coprocessor idea is neat, and maybe I'll explore it one day, I think the above is a lot more attainable and in my current level of skill while helping me learn the FPGA concepts and is something I'll be pursuing I think at some point once I get used to the Upduino and open source tooling, and have time to put it on an expansion card (probably Kevin's protoboard first).
Anyway it might be a silly idea but I thought I'd collect my thoughts to share with folks. If nothing else, it shows one of the powerful features of the X16 (the exposed bus and expansion cards). And if you got all the way to the bottom of this diatribe, you're awesome!
Couple points first though:
- My work on DreamTracker is far more important as I march towards a beta release, so this is mostly a fun thought exercise at the moment
- Part of the exercise is to learn about FPGAs and RISCV. Being able to tinker with both with the X16 just seemed interesting even if not practical since I can break things into small logical chunks (I'll explain in a moment). I can also play with these for some non-X16 projects though so I have options (once I have time anyway ).
- One of the things I love about the X16 is it's a simple base with which to build on and that base will likely be what most folks end up using. As a result, I'm not trying to reinvent the base architecture here and in fact I think it's incredibly important at this stage to keep the X16 as-is (beyond perhaps the 65816 upgrade given it may not require any hardware changes though even that I see as a bonus and not a requirement). This post isn't really changing the X16 itself.
- That said, since X16 has an expansion bus it makes for a fun platform to tinker on just for fun and perhaps this weirdo idea might turn into a nice solution for cartridges or become a popular enough expansion card it might see adoption
Anyways, while looking into things like FemtoRV, UpDuino (and other Lattice-based FPGAs) for the above, I soon realized some of these have enough pins to connect to the entire X16 bus (including some important additional pins, like IRQB, RDY, etc.). So my first/grand idea was to sort out what a RISCV co-processor on a card might look like. It would run inside an FPGA also with a soft-SOC, have it's own RAM and map the X16's RAM into its larger 32-bit address space. The idea being the X16 CPU will still be used for KERNAL calls where possible (some things might have be implemented in RISCV if needed, like a keyboard routine, else control would be passing back and forth) but issuing a call to the MMIO address of the card could instruct some front-end logic to pause the 6502 (via RDY), perhaps setup the value of the return, and then pass control to the RISCV core.
It seems like a crazy idea, but a neat one, though it's beyond my capability and is fairly esoteric in many ways. Given conversations on the Discord have come up about accelerator cards, I started thinking about ways to simplify this while still providing benefit. Just so happens the Lattice FPGA used by VERA is the same one used by the Upduino I'll be using initially. It has some on-board SRAM among other nice things. Looking at how the VERA is structured, it kinda points to an interesting solution.
Instead of hooking a RISCV up to the whole bus, one can just hook up the minimum part of the bus for the IO space (the IO pin plus the lower 5 address bus lines) and pass data back and forth through MMIO. In this way, one has a rudimentary set of registers with an instruction and status - so more or less a special purpose accelerator processing unit.
All told I came up with this:
Code: Select all
$9Fx0: Command / "Opcode"
$9Fx1: Control
$9Fx2: Status
$9Fx3: Operation Width
$9Fx4-$9Fx7: I1-I3 (Operand(s) 1)
$9Fx8-$9FxB: J1-J3 (Operand(s) 2)
$9FxC-$9FxF: R1-R3 (Result(s))
Code: Select all
; We want to multiply 2 numbers
lda #RV_MULTIPLY
sta RV_COMMAND
; Set precision/width (2 bytes)
lda #$02
sta RV_WIDTH
; Load I
lda #$12
sta RV_I
lda #$34
sta RV_I + 1
; Load J
lda #$56
sta RV_J
lda #$67
sta RV_J + 1
; With everything setup, tell the APU to go
lda #RV_EXECUTE
sta RV_CONTROL
; Wait for the results. This is a loop but could go do something else until the interrupt fires
@wait_for_result:
lda RV_STATUS
beq @wait_for_result
; Do something with the results
lda RV_RES
...
lda RV_RES + 1
...
This could be used for both simple or complex things, like multiply, divide, floating point, matrix/vectors even, etc. The implementation details could be up to the designer implementing the API. In my case, I could either implement a RISCV soft-core and write the math in software or could make some bespoke logic structures directly on the FPGA making a custom simple processor. From the standpoint of the X16, it just sees some memory addresses.
An more extreme case blending the original idea with this one, is having some operations optionally throw data directly to VERA (by pausing the 6502). That would require some additional address lines (though still not the entire bus since VERA is up in the I/O space as well) but if needing to do complicated matrix calculations or some such, that might be perhaps an option. Nice thing about an API over an entire RISCV coprocessor sitting on the bus, is the API can be more easily implemented in the emulators as well without an entire RISCV soft-core and all the complex handling of passing control back and forth.
It also could be used in a cartridge as an accelerator. This is where the RP2040 was mentioned given the cost and performance that would probably be in close to performance of the FPGA given the X16 bus speeds.
And while I think the RISCV coprocessor idea is neat, and maybe I'll explore it one day, I think the above is a lot more attainable and in my current level of skill while helping me learn the FPGA concepts and is something I'll be pursuing I think at some point once I get used to the Upduino and open source tooling, and have time to put it on an expansion card (probably Kevin's protoboard first).
Anyway it might be a silly idea but I thought I'd collect my thoughts to share with folks. If nothing else, it shows one of the powerful features of the X16 (the exposed bus and expansion cards). And if you got all the way to the bottom of this diatribe, you're awesome!