User Port Dreams: SPI

BruceMcF · Post by **BruceMcF** » Mon Sep 19, 2022 12:37 am

[Edited to reflect the news from David on Facebook about on VIA2 being an optional expansion card rather than built into the board.]

So, we are supposedly getting I2C. I am told (see below) that the CX16 at 8MHz when it's running its I2C without clock stretching gets about 150kbps-175kbps, or 18-22KB/s (raw throughput, minus bus managements & device command overheads).

And while I2C has been around for a long time, there is another fairly widely used communications protocol to talk to things like sensors and LCD panels and Real Time Clocks, which is the Serial Peripheral Interface, SPI. SPI has been used for letting micro-controllers talk to all sorts of peripherals at speeds appropriate to the application for decades now. Plus for anything where those not in the project might wish to "cheat" and have a RPi Nano or RPi Zero do, SPI is a very convenient way to talk to a Raspberry Pi board, typically substantially faster than RS232 serial.

So the I2C -- which I now know is running at around 150-175 kbps -- is what I was holding in mind as the built-in alternative to SPI over the User Port.... especially since I2C is more widely supported nowadays in most areas.

So I did some sketches. None of the code is tested, including because none of the hardware it would run on to be tested with exists, so the clock counts are only rough ballpark figures.

One challenge with all the approaches, but especially the fastest approaches, is that SPI comes in four flavors. First is the clock polarity. Mode 3 and Mode 1 work the same, except with an inverted clock. Mode 0 and Mode 2 work the same, except with an inverted clock. The "even" modes start and end their cycle of sending data (byte, normally) with the clock low, while the "odd" modes start and end their cycle of sending data with the clock high.

The thing is, to cope with a device with the "wrong clock polarity", you can just run the serial clock line through an inverter and bring out both the original serial clock and the inverted serial clock, and connect the one that gives the desired clock polarity. That can be done with an inverting line driver or by sending the same signal through both inputs of a two input NAND or NOR gate, so that is not something that would actually have to be dealt with in software if we are thinking about some interface board being developed to allow talking to some SPI peripheral chip..

Requiring more thought is the clock phase.

SPI works by having the master pull down a select line, then driving a clock line while it sends output to the servant device on the Master Out, Servant In line (MOSI, often pronounced "moe see"), and also receiving input from the servant device on the Master In, Servant Out line (MISO, sometimes pronounced "my so" and sometimes pronounced "me so"). So to talk to only one device, only four lines are required. Adding devices only require adding one more select line, since the clock, MOSI and MISO lines can work as a serial bus.

Any synchronous serial connection can work in one of two ways. You can assert the data and then toggle the clock, using that to latch the data, then toggle the clock back, using that to shift the data. Or you can toggle the clock, using that to shift the data and assert the data, then toggle the clock back, using that to latch the data.

What's going on is that if you want this kind of simultaneous swap of data using a serial shift register at the servant, you need an extra carry bit. However, it can hold either the bit being shifted in, or the bit being shifted out. If it is holding the bit being shifted, you connect MOSI to the carry and MISO directly to bit7 of the servant Serial Shift Register (SSR). The data is on the SPI bus if the device is selected, so toggle the clock to latch it, then toggle it back to do the shift, with the contents of the carry register going into bit0 and the old bit7 being over-written.

If the carry register is connected to the bit being shifted out, you connect MOSI directly to bit0 of the servant SSR, and connect MISO to the carry register. Now, the data in bit0 of the servant is in the way of the incoming data, so you shift the data left by toggling the clock, which opens up a space in bit0 of the servant register and puts the prior bit7 contents into the carry register, and then you latch the data on both sides by toggling the clock back to where it started.

Basically, there are 16 different ways to accomplish the kind of transfer that SPI is doing: select by going high or going low, shift right or shift left, start the clock high or low, and latch first or shift first, and all that SPI has standardized is select by going low and shifting left, leaving four different ways that different peripheral devices might implement their SPI. So the clock polarity is given a value of 0 if it starts low, 2 if it starts high, the clock phase is given a value of 0 if it latches first, 1 if it shifts first, and add them together to get the "Mode" of the device: Mode0 through Mode3. The SPI required to be supported by SD cards is the start low, latch first kind of SPI, so its Mode0. The VIA SSR is start high, shift first kind of shift register, so the most natural mode for the CX16 is Mode3. And while inverting clock polarity is easy to do in hardware, so connecting a Mode1 servant to a Mode3 master is no issue, connecting a Mode0 or Mode2 servant to a Mode3 Master is trickier.

I don't know enough about electronic circuits to know how to get around that problem for the fastest User Port approach I could come up with, which relies on a VIA serial shift register configuration that drives the serial shift register at half the system clock speed. But for slower approaches, which drive things a bit at a time, I think I've worked out how to handle it. The thing is, when you swap BOTH phase AND polarity, between the VIA "Native" Mode 3 and the SD card Native Mode 0, they both latch data on the rising clock edge and shift data on the falling clock edge. The only thing is that the Mode3 needs an "extra" half clock at the start, to shift out the first data, where Mode0 starts with the first bits of data already asserted. And Mode0 needs an "extra" half clock at the end, so the servant will shift the last bit of MOSI data in.

If generating the clock in software, that is not hard to handle. However, even with VIA hardware generating individual clock pulses, it is still straightforward to do. With literally a single logical gate connected to a GPIO, you can set up an SPI bus that allows a Mode3 Master to imitate a Mode0 master on the SPI Bus. This relies on a an extra output pin that masks out the Master's Serial Clock and holds the SP bus clock low. When accessing a Mode3 Servant, the mask is left off. When accessing a Mode0 servant, for a Mode0 device, the SPI bus clock line stays low until the mask is released, and the SPI clock goes high, so the servant device "doesn't know" that the Master's serial clock did an enter cycle, and instead sees the release of the mask as the first half of a cycle. At the end of the process. for a Mode0 Servant, the mask is pulled down into place again, and that gives the last half cycle the Mode0 servant needs to latch its data.

So that was the idea. How did it go? This also gives a baseline idea about how slow it might be.

The first sketch is about what can fit into the resources now available on VIA1. In the current version of the board, all spots on VIA1 are in use by the system, but removing the PS/2 interface to an MCU frees up four VIA pins. Those pins could be brought out to a block header, alongside the CA1 and CA2 handshake lines that the system does not use. And that is enough to provide a "bit banged" SPI interface: SCLK, MISO, MOSI, Select. Indeed, it is enough to support a generic SPI interface that can be fanned out to support multiple SPI devices, because CA2 can be set to be low or high, so it works as an additional output line. And a second select plus two glue logic chips on the other side of the interface (a "NOR" gate and a Serial Shift Register) supports using the SPI bus to write a select byte for up to eight devices into a serial when the CA2 is selected, and then wire the primary Select line from Port A or B to the serial shift register Output Enable, so the loaded selection from up to eight choices is asserted by a single GPIO.

In data port A, pin 0 is the serial clock, pin 1 is Master Out, Servant in, in Port B pin 0 pin 7 is Master In, Servant Out, and pin 6 Select. CA2 is alternative select, and CA1 is available for an /Alert signal for those SPI devices that have an extra pin to send an interrupt or error warning. No actual serial clock mask is needed for the serial clock line, since the serial clock line is being driven entirely from software.

In the words of Crocodile Dundee, "you can live on it, but it tastes like sh*t". There's no guarantee that I have the fastest possible code, but I think its pretty close, and I get a routine to send/receive a byte at 311 clocks in Mode0, 306 clocks in Mode3, which is an effective maximum bandwidth of around 200Kbps, or just a touch faster than the CX16 I2C bus -- from 14%-33% faster, depending on where it lands in the 150kbps-175kbps range for stable operation.

Now, suppose that you bought that expansion User Port card. In that case, you have all of a VIA to work with, so it's possible to "builda better big-banger" than this. The next sketch builds the input byte "in place" by setting Port B pin0 to input, connected to MISO, and the rest of the PortB pins to output, not connected but acting as bit registers. The output byte is shifted left to put the first output bit into the carry flag, and the rest stored in PortB, so the input is shifted into place at the same time that the output is shifted to the carry flag. MOSI, Select and Alert are in PortA along with 5 device selects. The Device Select routine creates complete Bit=0 and Bit=1 states for Port B, including the correct select flag and the clock low so storing the value of the MISO bit also starts the clock cycle. The clock pin is PortA pin0, so an "INC PORTA" instruction toggles the serial clock line back up again The Bit0 and Bit1 states for PortA are loaded into registers A and X at the beginning of the routine and Y is used for the loop index.

This is a bit better, with around 229 or 234 clock cycles per byte, depending on Mode, so between 34KB and 35KB per second, or between 270 and 280 Kbps throughput ... now 50% faster than CX16 I2C. The wiring is every bit as easy as the PortA-only bit banged approach, just connect MISO, MOSI, SCLK and the Selects to the correct pins on the UserPort header.

Next faster is to have the VIA generate the Serial Clock itself using the handshake lines of Port A. You can put PortA into a mode where it puts a pulse out when Port A is written to. So this approach dedicates PortA to Master Out, Servant In, with Pin7 connected to MOSI, with the bits shifted in by simply shifting them, "ASL CIA2_DataA". The shift works because the 65C02 processor does a shift in memory by reading the RAM, then doing a redundant read of the RAM while shifting the data, then writing the RAM. The NMOS 6502, or 6510 in the C64, does a write for the second clock cycle when the shift is actually being performed, so it would trigger the pulse twice for each shift. MISO is connect to Port B Pin7, which is shifted out. Port B also It holds two select lines and the serial clock "mask" line in pairs of pins, which answer the question of how you shift the MISO data out without messing up the status of the other pins in the register. Pins 6+5, 3+4 and 1+2 are wired together, with pines 6, 4 and 2 being set up as outputs, and 5, 3 and 1 being set up as inputs. When the data is shifted, the input pins don't care, directly, but their neighbor that has data written to it is reflected in the input pin. Then when the shift instruction picks up the data in the Read cycle of Read-Modify-Write, the shift puts the correct value back into the output pin, and then the wire brings that value back to the input pin.

Finally, the SPI bus clock mask to support Mode0/Mode2 devices can be provided with a single two-input AND gate, with one input connected to the Mode3 SCLK and another connected to Pin6 of PortB, works to mask the serial clock line. An alternative is to use a NOR gate. That inverts the clock, but running the inverted clock through both inputs of another NOR gate will bring it right way up again, so the hardware to handling the Mode0, Mode1 and Mode2 translations is a single quad two-input NOR gate, with two of the gates still free.

Anyway, after the first bit is managed "by hand" to handle the store to PortA rather than shifting it, and "unmasking" of the serial, the following bits are "ASL PortA : ASL PortB : ROL", which speeds things up a bit. If I "unroll the loop" by putting that three instruction sequence into the code seven times, I get about 195 clocks for Mode3, 179 clocks for Mode0, which at 8MHz is about 325-350 Kbps. So offloading the work of bit banging the Serial Clock doesn't quite get us up to I2C fast speeds, ... but it's 85%-100% faster than CX16 I2C.

But in any event, all three of these are competing against the I2C routine with one hand tied behind their back, since they aren't actually making use of the serial shift register built into the VIA. So the fourth approach uses the VIA SSR as MISO. It puts the Serial Shift register into the mode where it is driven by an external serial clock, and uses the PortA handshake line to generate that serial clock. That is the approach that runs Mode0 in 99 clocks for a byte and Mode3 for 89 clocks for a byte, or roughly 640-720Kbps bandwidth, in the neighborhood of four times faster than CX16 I2C. .

One quirk of that approach is that SPI works with the Most Significant Bit first, while the VIA SSR is Least Significant Bit first. So part of the third approach is a 256 byte table that gives the mirror image of the MISO byte in the shift register ... because it will finish the eight SPI serial cycles with Bit7 where Bit0 is supposed to be and Bit0 where Bit7 is supposed to be.

Now, it can get faster. You can add an external serial shift register, and use the "byte single shot" mode of the built in CIA SSR to put out a byte within 20 clocks of being stored to the VIA serial shift register ... getting throughput as high as 1Mbps (over 128KB/s), like one of the newer higher speed variants of the I2C bus. But what I'd want is a four mode SPI interface ... and I can only see how to make that work for a Mod3/Mode1 interface. I don't know enough about hardware circuitry to build the circuit to pin the mask down, then release the pin but continue to hold the mask down until the native SCLK line goes high and then release it. I know there is some combination of bit latches and flip flops than can do that, but I don't know exactly what combination that would be.

Still, for most of what I would want to talk to through the interface (up to and included cheating and using an RPi W to copy files from my PC to my Commander X16), I think that 80KB/s might well do me. Plus, being able to support all four modes with two, two-input NOR gates is really appealing, since the standard "74" glue logic comes with four two-input gates per chip.

SPI_VIA_Version2.asm

Wavicle · Post by **Wavicle** » Tue Sep 20, 2022 7:30 am

Just wanted to clear up the I2C speed question - using nearly every trick available, I2C is running stable at 150-175KHz. It might be possible to run it faster, but not by much. It does support clock stretching so slower I2C devices that can't go over 100KHz shouldn't have a problem (as long as they apply back-pressure by stretching the clock).

Wavicle · Post by **Wavicle** » Tue Sep 20, 2022 7:35 am

I should also add that both Kevin and I are leaning towards making VIA2 optional using a header on our respective boards. There may be enough pins on VIA1 now to do SPI there. I briefly mentioned this to Kevin, but he is generally cautious about adding new functionality.

BruceMcF · Post by **BruceMcF** » Tue Sep 20, 2022 12:14 pm

On 9/20/2022 at 3:35 AM, Wavicle said:

I should also add that both Kevin and I are leaning towards making VIA2 optional using a header on our respective boards. There may be enough pins on VIA1 now to do SPI there. I briefly mentioned this to Kevin, but he is generally cautious about adding new functionality.

Wait, what? No User Port?

There are enough pins on VIA1 to do SPI by pure bit-banging, but it would of course be slower. Indeed, while four more GPIO on VIA1 would be handy in a number of User Port applications, on their own then a slow SPI bus is about all they would be useful for.

To support a slow SPI bus on VIA1, you'd want to move NES Latch and NES Clock to PortB, so that the four NES lines in PortA are all input lines, Then you can bit bang PA.0 as SCLK, PA.1 as MISO, PA.2 as MOSI, and PA.3 as /Select, with CA2 as an extra select output line via the PCR register at BASE+$0C =%xxxx11bx, allowing either 2 selects if used directly, or 3 if fed through a 2-4 active low decoder with one line unused.

The main bit transfer cycle would be something like the following, where:

"MOSI" is a zero page location shifting the output byte out and the input byte in

CPOL is a general location storing the clock phase, $00 or $01

MOSI0 is a zero page location with the byte to store to PortA for a zero bit

MOSI1 is a zero page location with the byte to store to PortA for a one bit

The Select routine and the end of this routine implies the Serial Clock starts high if this is a mode3 transfer, or low if it is a Mode0 transfer

SPI_BYTE: ; Called with SPI output byte in A, returns with SPI input byte in A, uses X, Y.

ROL : STA MOSI : LDX MOSI1 : LDY #7

- LDA MOSI_bit0 : BCC + : TXA : + STA VIA1_PA : INC VIA1_PA : LDA VIA_PA : LSR : LSR : ROL MOSI : DEY : BNE -

LDA MOSI : ROL : LDX CPOL : BNE + : DEC VIA_PA : + RTS

Clocks ~= 10+[7*35-1]+12+11 = 277 clocks for Mode3, 283 clocks for Mode0, 220-230 Kbps

So long as you have 2 select GPIO available, with a little more work you can get a generic number of selects available, since you can use an AND gate with the SERIAL clock and one of the selects to drive the clock of a tri-state parallel out Serial Shift register when the select is High, and load a device select byte into the SSR, then pull that select low and have the other select be tied to the OutputEnable on the serial shift register. That would require a new SPI_SELECT subroutine, storing the last select byte to work out whether it is necessary to write a new value into the Select SSR ... but the SPI_BYTE subroutine would not change.

Wavicle · Post by **Wavicle** » Tue Sep 20, 2022 3:29 pm

On 9/20/2022 at 5:14 AM, BruceMcF said:

Wait, what? No User Port?

It's under consideration. Nothing definitive yet, but if removed from the base design, there will be a dedicated header for those who would like to add it back in. Its location in IO space will not change.

BruceMcF · Post by **BruceMcF** » Tue Sep 20, 2022 3:55 pm

On 9/20/2022 at 11:29 AM, Wavicle said:

It's under consideration. Nothing definitive yet, but if removed from the base design, there will be a dedicated header for those who would like to add it back in. Its location in IO space will not change.

A header or a socket?

I mean, I could live with an empty socket, for those who don't want to make use of the User Port. The kind of people who are going to use the User Port are also not likely to complain TOO much about having to buy a VIA and install it for their first User Port project.

But the two big appeals of the User Port over an extension board for the CX16 (or a cartridge for the C64) are the source of the ease of use in projects that are based on it. First, the interface TO the user port is standard. Whether its a card edge or a block header is neither here nor there to me, but having a block header with the VIA lines at set locations avoids the headache of project #1 wires the VIA this way and project #2 wires the VIA that way. And second, it's on the "low frequency" side of the VIA, so the frequency domain can never get higher than PHI2/2 and can be as slow as the driving software lets it be.

It's a stable platform for hardware projects in the way that a stable Kernel API is a stable platform for software projects, and not having to hit PHI2 bus timings makes it more robust for beginners to hardware projects.

Shuffle the NES Latch and Clock to VIA1.PB0 and VIA1.PB1, to consolidate the "free" GPIO into VIA1.PA0-VIA1.PA1, put a header with +5VCC, GND, VIA1.PA0-PA3, VIA1.CA1/2, and any VIA2 lines not used in a parallel port interface (I forget ... maybe VIA2.PB6/7, and VIA2.CB1/2), and another header with the PC parallel port layout from VIA2.PA0-7, VIA2.PB0-?, CA1/2.

Then there is a "teaser" User Port that can be used for a limited range of projects, and installing a VIA2 in the socket gives a standard parallel port and bit-banged serial port on two separate block headers, or an ability to use two ribbon cables to plug into a breadboard that has 20 GPIO and 3 pairs of handshake lines, including one pair that can connect to a shift register.

rje · Post by **rje** » Tue Sep 20, 2022 5:46 pm

I have little to contribute, except that I have worked with I2C on the RPi, gathering data from an ultrasonic motion detector, and a range finder. I found I2C on the RPi to be easy to work with, enabling fun.

I've *seen* the SPI pins on the Pi as well, and I just assume it is also easy to work with.

Because I'm not a hardware guy, I look for Ease Of Use when it comes to hardware projects. I can solder a header onto a Pi, if I'm careful. That's about my limit.

BruceMcF · Post by **BruceMcF** » Tue Sep 20, 2022 7:48 pm

On 9/20/2022 at 1:46 PM, rje said:

I have little to contribute, except that I have worked with I2C on the RPi, gathering data from an ultrasonic motion detector, and a range finder. I found I2C on the RPi to be easy to work with, enabling fun.

I've *seen* the SPI pins on the Pi as well, and I just assume it is also easy to work with.

Because I'm not a hardware guy, I look for Ease Of Use when it comes to hardware projects. I can solder a header onto a Pi, if I'm careful. That's about my limit.

AFAIU, I2C is easy to use if its built into the thing you are using, and it is the most economical in pins of the general purpose serial buses ... its easier for SPI to go fast (note what Wavicle reports about I2C speed being around 150-175kbps, while I think (as shown above) by being more careful in the programming of the big banged interface, even a bit banged SPI can go moderately faster, and if the VIA Serial Shift Register can be put to use, it could well be 3-4 times faster.

So for talking to keyboard or a mouse, I2C is just fine. For talking to a RPi Nano acting as a USB Flash drive bridge, I'd rather use SPI, and then rather have enough of the VIA resources to give it 80KB/sec bandwidth.

BruceMcF · Post by **BruceMcF** » Wed Sep 21, 2022 12:53 pm

On 9/20/2022 at 3:35 AM, Wavicle said:

I should also add that both Kevin and I are leaning towards making VIA2 optional using a header on our respective boards. There may be enough pins on VIA1 now to do SPI there. I briefly mentioned this to Kevin, but he is generally cautious about adding new functionality.

____________________________

[NOTE: I have edited the original post to reflect the information that has come to light in this discussion -- and also the improved "four line big banger" routine that I came up with along the way.

____________________________

What Dave said on Facebook was an expansion card. If the expansion card had one VIA soldered in with Port A and Port B block headers with power in each, and a socket for a second VIA with two more Port block headers ... that would be partial compensation for losing VIA2 from the motherboard.

An expansion card available from the project itself is better than a block header on the motherboard, because one of the big appeals of a User Port over the expansion bus is that its a stable base for the hardware projects that attach to it. The block header is going to lead to a lot of different things that plug into it, and being after--market hardware they won't all be compatible with each other, splintering hardware project designs that rely on the thing plugged into the block header.

If there is a User Port expansion card that is available for pre-order alongside the CX16 itself, that pre-empts the after-market reinventing that wheel for adding one or two VIA's to the CX16 system.

Wavicle · Post by **Wavicle** » Wed Sep 21, 2022 4:28 pm

Kevin posted an update to Facebook on the topic of 6522s and user port: Commander X16™ Prototype | Facebook