REU: the poor man's blitter.

zBeeble · Post by **zBeeble** » Sun May 16, 2021 4:27 am

So ... I've been watching with interest, but not contributing because I didn't see anything I needed to say. I do now.

I think the function of the 64/128 REU's memory transfer function is under-appreciated. It's like a budget blitter chip. Your memory design has 2M of memory in an 8k bottleneck. Something along the lines of the REU that could:

- memory <--> bank

- bank <--> vera

- memory <--> vera

might make things massively more functional and enable a slate of games that wouldn't be possible otherwise. One flaw in the single 8k bank is that you can't effectively have segmented code and segmented data in a program ---- you have to choose one.

I looked at REU implementation for the MiSTeR FPGA project --- it's not a lot of FPGA code (or space) to do that.

Anyways... just a comment.

Scott Robison · Post by **Scott Robison** » Sun May 16, 2021 5:13 am

45 minutes ago, zBeeble said:

So ... I've been watching with interest, but not contributing because I didn't see anything I needed to say. I do now.

I think the function of the 64/128 REU's memory transfer function is under-appreciated. It's like a budget blitter chip. Your memory design has 2M of memory in an 8k bottleneck. Something along the lines of the REU that could:

- memory <--> bank

- bank <--> vera

- memory <--> vera

might make things massively more functional and enable a slate of games that wouldn't be possible otherwise. One flaw in the single 8k bank is that you can't effectively have segmented code and segmented data in a program ---- you have to choose one.

I looked at REU implementation for the MiSTeR FPGA project --- it's not a lot of FPGA code (or space) to do that.

Anyways... just a comment.

The problem is that the FPGA doesn't have direct access to the 64K primary address space or the 2M banked memory. Everything that goes on in the FPGA happens in a 32 byte window in the IO area. Thus it wouldn't be able to do stash / fetch / verify.

zBeeble · Post by **zBeeble** » Sun May 16, 2021 5:28 am

I didn't explicitly say you had to use that FPGA and I realize that the current design precludes it. But vera -> bank would allow direct loading of the bank from media... so it makes sense.

The REU + GEOS (about the only thing other than TMP that used it) ... the thing that made it rock was that the REU could blit an 8k page is 8192 cycles (or 1/1000th a sec at your 8Mhz speed .... a lowly 1/100th-ish at the C64's speed).

I'm just saying ... that I get leaving things out. I get minimalism. But the REU blit function is simple, concise and easy to understand ... and extraordinarily powerful. And it's period authentic, to boot.

And... yes... you could make a card (likely, I haven't looked at your slot diagram) that can at least, I assume, do the base 64k access --- but (and I'm assuming here) it would be missing the access to the 2M of memory _and_ the 128k of video memory. So having it in the base system makes the most sense.

Anyways... I have no say here --- just saying it's a missed opportunity.

Scott Robison · Post by **Scott Robison** » Sun May 16, 2021 8:55 am

3 hours ago, zBeeble said:

I didn't explicitly say you had to use that FPGA and I realize that the current design precludes it. But vera -> bank would allow direct loading of the bank from media... so it makes sense.

The REU + GEOS (about the only thing other than TMP that used it) ... the thing that made it rock was that the REU could blit an 8k page is 8192 cycles (or 1/1000th a sec at your 8Mhz speed .... a lowly 1/100th-ish at the C64's speed).

I'm just saying ... that I get leaving things out. I get minimalism. But the REU blit function is simple, concise and easy to understand ... and extraordinarily powerful. And it's period authentic, to boot.

And... yes... you could make a card (likely, I haven't looked at your slot diagram) that can at least, I assume, do the base 64k access --- but (and I'm assuming here) it would be missing the access to the 2M of memory _and_ the 128k of video memory. So having it in the base system makes the most sense.

Anyways... I have no say here --- just saying it's a missed opportunity.

One, just to be clear ... I am not one of the designers. I have no say, just expressing thoughts too.

Two, I'd love to see something like that myself, but it couldn't be done with the existing FPGA because it doesn't have the space or connections at present that would be required.

The one thing the Commander X16 has going in its favor over is a higher clock rate, but of course that will still take multiple clock cycles per byte to copy from a RAM bank.

There are lots of things that would be cool to see, it's just a matter of money at this point.

paulscottrobson · Post by **paulscottrobson** » Sun May 16, 2021 10:15 am

1 hour ago, Scott Robison said:

The one thing the Commander X16 has going in its favor over is a higher clock rate, but of course that will still take multiple clock cycles per byte to copy from a RAM bank.

A fair bit of this is lost because of the preponderance of 16 bit data in the video system, which the 6502 does not handle very efficiently.

BruceMcF · Post by **BruceMcF** » Sun May 16, 2021 2:03 pm

8 hours ago, zBeeble said:

And... yes... you could make a card (likely, I haven't looked at your slot diagram) that can at least, I assume, do the base 64k access --- but (and I'm assuming here) it would be missing the access to the 2M of memory _and_ the 128k of video memory. So having it in the base system makes the most sense.

Except access to the base 64K IS access to the 2M of High RAM and the 128k of video memory, since they are all accessed within the 64K address space.

zBeeble · Post by **zBeeble** » Sun May 16, 2021 3:35 pm

58 minutes ago, BruceMcF said:

Except access to the base 64K IS access to the 2M of High RAM and the 128k of video memory, since they are all accessed within the 64K address space.

Well... no... that is: no in many ways. I suppose you could make your card flipity-flip through it... but this would deny the action of one monolithic act. Not only that --- it would probably break if the CPU was using the existing mapping in any way. So many ways... no.

Even the choice to have one mapping region (rather than, say, 2 4k mapping regions) has a huge out-size effect.

Let's say you decide to divide your big app into code segments. In the 40k, then, you need a smart dispatch and _all_ the data. If you use the pages for code, you can't effectively use them for (other than local) data.

Ok... restart. you decide to have data segments. Cool. Then _all_ your code (and some data) need be in the 40k. This works for small programs, but falls down as code size approaches 40k.

The REU is a dead simple interface and gives the user a wide menu of uses with a dead-simple interface. I forget if it's start+end or start+count ... + direction ... but same difference. And it copies one byte per cycle.

Your best loop is going to be

loop:

     LDA $SRC

     STA $DST

     INC $loop+1

     BNZ $g1

     INC $loop+2

g1:

     INC $loop+4

     BNZ $g2

     INC $loop+5

g2:

     DEX

     BNZ $loop

     DEY

     BNZ $loop

What did we do here? the 16 bit number with x low byte and y high byte is the count. We're using memory modification (silly on modern computers, but fast on these old chips) for the SRC and DST. For talking to the vera, you save 2 of the 4 INCs.

So... let's count the cycles... we'll only count the cycles where none of the 16 bit rollovers are happening.

LDA - 4, STA - 4, INC - 6 * 2, BNZ - 3 * 3, DEX - 2. So that 22 cycles for Vera and 31 for regular memory.

... so something that is around a few dozen lines of verilog gives you a move operation that is between 22 and 31 _times_ faster than your other options.

some of you will chime in about ,X addressing. I always wrote my loops this way... Just did it again by habbit, I suppose. The argument is still sound --- one byte per cycle is unimaginably fast w.r.t the 65c02.

BruceMcF · Post by **BruceMcF** » Sun May 16, 2021 11:23 pm

So you have a bus mastering card in a slot. Like the original REU was a cartridge.

Even if you have a dead simple REU, access to the 64K is all you need. What you need is two MODES: one of which works with a range of RAM, the other works on a single RAM address.

You toggle Bank 18 of High RAM in, suck 8K into the REU, toggle Bank 30 in, blast 8K from the REU. The overhead of the CPU setting the Bank is trivial for moving 8K that fast. If THAT is what you mean by "I suppose you could make your card flipity-flip through it... but this would deny the action of one monolithic act" ... treating the overhead of a CPU action to change banks ONCE PER 8,192 bytes as "denying the action of one monolithic act" is silly ... the program will be USING the segments in units of banks, the SD card routine will be LOADING them into one or more consecutive banks, loading data into a launch pad section of banks to get it into the REU is the sensible way to operate anyway.

And there's absolutely not even "once per 8K" flippity-flip" for the REU, because of the way the port access works. You set up the autoincrement in the Vera for the data port you are going to use and the type of data load you are going to do, and point the REU in "Port Channel Mode" to either DataA or DataB. Now you can blast in a whole bitmap, or text screen, or bitmap row, or bitmap column, or palette, or PCM buffer, etc.

Indeed, one thing the card would want to have would be the ability to work in SMALLER units in order to avoid stalling interrupts for such a long period. 8K is probably too MUCH to do at once ... it's probably fine to have a byte register for size of move and move 1-256 bytes at a time. Then if 256 cycles is not going to mess up your music player or whatever (because you are setting up), you move things a page at a time, in 256 clocks plus loop overhead, and if you need more granularity, it's available ... if you know that 32 clock cycles is the longest you can hold off interrupts without messing with the music player in the middle of a game, you move things 32 bytes at a time, in 32 clocks plus loop overhead.

But just don't clear the target and internal address registers at the end of a cycle and restore the count register to what it was at the beginning, and it chains perfectly well.

As you say, it is not very much system resources in a programmable logic chip. It is conceivable that an REU with a 512K RAM chip and a high speed two channel USART interface could fit into a CPLD.

zBeeble · Post by **zBeeble** » Mon May 17, 2021 1:43 pm

But the solution you propose still doesn't deal with the CPU using the 8K bank at the time. I'm just saying there's a real reason for a design change.

I don't think we've discussed whether the Vera can take a store (or fetch) per cycle, either... for that matter.

ZeroByte · Post by **ZeroByte** » Mon May 17, 2021 2:47 pm

Well, during blit operations, the CPU would be suspended, so it doesn't really matter what it was doing at the time. It's like hypersleep for long space voyages. You get in the freezer, and when you get out what feels like mere moments later, it's been 18 years and you're halfway across the galaxy. Sure, your kids grew up while you were gone, but you knew you were signing up for that when you applied for the mission. Same for mass blits. So long as the REU leaves the RAM/ROM/VERA pointers in whatever states they were in when it started, then no harm will come to the software using a DMA controller card installed in the expansion slots. If the controller doesn't do that - then you could SEI , do DMA, fix any bank / vera pointers needed, then CLI.

And as @BruceMcF points out - having to switch RAM banks during a block copy operation is of negligible impact. You'd just need to build the DMA controller to know about the banking structure and issue the appropriate bank swap writes and update its src/dst pointers accordingly.