Page 2 of 2

REU: the poor man's blitter.

Posted: Tue May 18, 2021 11:29 am
by BruceMcF


20 hours ago, ZeroByte said:




And as @BruceMcF points out - having to switch RAM banks during a block copy operation is of negligible impact. You'd just need to build the DMA controller to know about the banking structure and issue the appropriate bank swap writes and update its src/dst pointers accordingly.



Since I am "imagining doing" the REU, I certainly was NOT having the DMA controller know ANYTHING about banking structure ... I was having the CPU handle that.

So, one register has the chunk size control, maybe the control register, maybe another. The control register has whatever is has, but the readable bit 0 is "start" when set to 1, remains 1 for the operation (which doesn't matter because the CPU is asleep), and goes to 0 when completed. Point the REU address to the source, the target address to $A000. The bank is in A and the number of chunks is in X. Y is transient. "0" in X implies a count of 256.

PASTEBANK:  LDY $0 : PHY : STA $0  : LDA REUCONTROL : ORA #1 : - STA REUCONTROL : DEX : BNE - : PLA : STA $0 : RTS

My Interrupt code has to not change the bank without storing it, but the longest an interrupt has to wait is 35 cycles, because in effect STA REUCONTROL is a 35 clock cycle instruction that copies 32 bytes.

The overhead is 8 clocks on top of 32, so 25% overhead. If you want lower overhead, make the chunk bigger. At 128 byte chunks, it's 6.3% overhead. At 256 byte chunks, it's 3.2% overhead. So I don't see any particular reason why it would ever need to be bigger than 128 byte chunks ... making that a loop adds 3 bytes to multiply the block moved by up to 256, so a maximum 128 chunk covers 32K, where the maximum Bank move is 8K before time to increment the bank register. In the above, if interrupts can touch transient zero page API space and this is rommable code so I can't just store after this routine, I can have 128 byte chunks, if I am not worried about interrupt lagginess, X will have 64 in it, and Y can say have many blocks:

PASTEBANKS: SEI : STA $20 : LDA $0 : PHA : LDA $20 : CLI : STA $0 : LDA REUCONTROL : ORA #1 : PHX : -- PHX : PLX : - STA REUCONTROL : DEX : BNE - : INC $0 : DEY : BNE -- : PLX : PLA : STA $0 RTS

(Unless I have the meaning of SEI and CLI reversed ... it's been over 40 years) ... A power of two chunk size is 0 for 1 byte through to 7 for 128 bytes, so three bits in the REUCONTROL register for chunk size. 1 bit for increment target versus stable target, 1 bit for direction (copy into REU, paste into CX16) ... we still have two bits in the REU control register. Two bytes for CX16 address,  three bytes for REU address If we have a 512K SRAM in the REU, and plenty of room for expansion if people decide they want bigger ones.

 

 


REU: the poor man's blitter.

Posted: Tue May 18, 2021 1:31 pm
by ZeroByte

I think that’s fancier than it needs to be. If I were making such a device I’d just have a few registers: src/dst_addr, src/dst_stride, src/dst_bank, and a 16-bit num_bytes register. Finally, a go register that has bit flags for latching behaviors of the DMA parameter registers upon completion, where 1=reset to beginning value, 0=leave them at the final value.

If you use zeros for the start_DMA value, when DMA is complete, the addresses in the src/dst regs will be one stride past the last byte, and num_bytes will be zero, and if the banks switched during transfer, the src/dst will be the ones where the next src/dst bytes live.

 So if you wanted to do writes a page at a time, you could just write 255 into num_bytesand set a 1 flag for the num_bytes field in your “begin DMA” write. it would pick up right where it left off.

DMA into VERA this way is just setting dst_addr to $9f23 and dst_stride to 0. No special mode is needed.

As for banking, I think it might make sense that if during the DMA, it should halt if a src or dst pointer underflows from zero, or strides forward from base RAM to $A000+

leaving the bytes remaining in the num_bytes register.

if either src or dst starts in a bank window, then it uses the values in src/dst bank and if it strides out of the window, wrap the pointer around and inc/dec the bank #.

This device wouldn’t need any RAM of its own to do these DMA transfers. To me, if it were to have its own RAM, it should have gobs of it, like 32Mb so it adds value , like a RAMdisk or something, and that RAM should be referenced as a flat blob so no dealing with banks internal to the REU.

Let the programmer decide whether it’s a good or bad idea to DMA 128k in one shot or in smaller chunks to allow other operations. If I’m on a loading screen anyway, I don’t need to keep stopping and saying “are we there yet?”

Maybe I want to hop into the hyper sleep pod and don’t thaw me out until we’re in the Andromeda galaxy. Yes, I know my grandkids will be old by the time we’re there, but that’s the mission I signed up for.

 


REU: the poor man's blitter.

Posted: Tue May 18, 2021 10:46 pm
by Kalvan

VERA would still need a second DMA (otherwise you get bus contention issues with its internal VRAM) and much more than 32 bits of interface with System RAM to make this work.

 

Hindsight on matters like these is always 20/20.


REU: the poor man's blitter.

Posted: Tue May 18, 2021 11:21 pm
by BruceMcF


9 hours ago, ZeroByte said:




I think that’s fancier than it needs to be. If I were making such a device I’d just have a few registers: src/dst_addr, src/dst_stride, src/dst_bank, and a 16-bit num_bytes register. Finally, a go register that has bit flags for latching behaviors of the DMA parameter registers upon completion, where 1=reset to beginning value, 0=leave them at the final value.



If you use zeros for the start_DMA value, when DMA is complete, the addresses in the src/dst regs will be one stride past the last byte, and num_bytes will be zero, and if the banks switched during transfer, the src/dst will be the ones where the next src/dst bytes live.



 So if you wanted to do writes a page at a time, you could just write 255 into num_bytesand set a 1 flag for the num_bytes field in your “begin DMA” write. it would pick up right where it left off.



DMA into VERA this way is just setting dst_addr to $9f23 and dst_stride to 0. No special mode is needed.



As for banking, I think it might make sense that if during the DMA, it should halt if a src or dst pointer underflows from zero, or strides forward from base RAM to $A000+

leaving the bytes remaining in the num_bytes register.



if either src or dst starts in a bank window, then it uses the values in src/dst bank and if it strides out of the window, wrap the pointer around and inc/dec the bank #.



This device wouldn’t need any RAM of its own to do these DMA transfers. To me, if it were to have its own RAM, it should have gobs of it, like 32Mb so it adds value , like a RAMdisk or something, and that RAM should be referenced as a flat blob so no dealing with banks internal to the REU.



Let the programmer decide whether it’s a good or bad idea to DMA 128k in one shot or in smaller chunks to allow other operations. If I’m on a loading screen anyway, I don’t need to keep stopping and saying “are we there yet?”



Maybe I want to hop into the hyper sleep pod and don’t thaw me out until we’re in the Andromeda galaxy. Yes, I know my grandkids will be old by the time we’re there, but that’s the mission I signed up for.



 



Yeah, I was avoiding a stride, to make it simpler. With a stride in the Vera, it's not necessary to have a stride in the data created to go into the Vera.

And letting the count be smaller, so make it simpler, even if the VHDL covers over the extra complexity.

So it seems to me you are saying what I was sketching is more complicated and then replacing it with a more complicated one, to avoid having the built in REU RAM, when I never even mentioned the more complicated part! (yet)

But a design without its own RAM is half the speed when filling the PCM with pre-designed chunks of data, or when filling Vera bitmaps with pre-designed chunks of data. I would not want to give up the chance at that 1 byte per clock, rather than 1 byte per two clocks, if I have the ability for it to be 1 byte per clock right there.

And having those residing IN the REU frees up storing them in the CX16 RAM, so it frees up anywhere up to 512K RAM in the CX16, which is anywhere from 25% to 100% of High RAM.