User Port Dreams: SPI
Posted: Mon Sep 19, 2022 12:37 am
[Edited to reflect the news from David on Facebook about on VIA2 being an optional expansion card rather than built into the board.]
So, we are supposedly getting I2C. I am told (see below) that the CX16 at 8MHz when it's running its I2C without clock stretching gets about 150kbps-175kbps, or 18-22KB/s (raw throughput, minus bus managements & device command overheads).
And while I2C has been around for a long time, there is another fairly widely used communications protocol to talk to things like sensors and LCD panels and Real Time Clocks, which is the Serial Peripheral Interface, SPI. SPI has been used for letting micro-controllers talk to all sorts of peripherals at speeds appropriate to the application for decades now. Plus for anything where those not in the project might wish to "cheat" and have a RPi Nano or RPi Zero do, SPI is a very convenient way to talk to a Raspberry Pi board, typically substantially faster than RS232 serial.
So the I2C -- which I now know is running at around 150-175 kbps -- is what I was holding in mind as the built-in alternative to SPI over the User Port.... especially since I2C is more widely supported nowadays in most areas.
So I did some sketches. None of the code is tested, including because none of the hardware it would run on to be tested with exists, so the clock counts are only rough ballpark figures.
One challenge with all the approaches, but especially the fastest approaches, is that SPI comes in four flavors. First is the clock polarity. Mode 3 and Mode 1 work the same, except with an inverted clock. Mode 0 and Mode 2 work the same, except with an inverted clock. The "even" modes start and end their cycle of sending data (byte, normally) with the clock low, while the "odd" modes start and end their cycle of sending data with the clock high.
The thing is, to cope with a device with the "wrong clock polarity", you can just run the serial clock line through an inverter and bring out both the original serial clock and the inverted serial clock, and connect the one that gives the desired clock polarity. That can be done with an inverting line driver or by sending the same signal through both inputs of a two input NAND or NOR gate, so that is not something that would actually have to be dealt with in software if we are thinking about some interface board being developed to allow talking to some SPI peripheral chip..
Requiring more thought is the clock phase.
SPI works by having the master pull down a select line, then driving a clock line while it sends output to the servant device on the Master Out, Servant In line (MOSI, often pronounced "moe see"), and also receiving input from the servant device on the Master In, Servant Out line (MISO, sometimes pronounced "my so" and sometimes pronounced "me so"). So to talk to only one device, only four lines are required. Adding devices only require adding one more select line, since the clock, MOSI and MISO lines can work as a serial bus.
Any synchronous serial connection can work in one of two ways. You can assert the data and then toggle the clock, using that to latch the data, then toggle the clock back, using that to shift the data. Or you can toggle the clock, using that to shift the data and assert the data, then toggle the clock back, using that to latch the data.
What's going on is that if you want this kind of simultaneous swap of data using a serial shift register at the servant, you need an extra carry bit. However, it can hold either the bit being shifted in, or the bit being shifted out. If it is holding the bit being shifted, you connect MOSI to the carry and MISO directly to bit7 of the servant Serial Shift Register (SSR). The data is on the SPI bus if the device is selected, so toggle the clock to latch it, then toggle it back to do the shift, with the contents of the carry register going into bit0 and the old bit7 being over-written.
If the carry register is connected to the bit being shifted out, you connect MOSI directly to bit0 of the servant SSR, and connect MISO to the carry register. Now, the data in bit0 of the servant is in the way of the incoming data, so you shift the data left by toggling the clock, which opens up a space in bit0 of the servant register and puts the prior bit7 contents into the carry register, and then you latch the data on both sides by toggling the clock back to where it started.
Basically, there are 16 different ways to accomplish the kind of transfer that SPI is doing: select by going high or going low, shift right or shift left, start the clock high or low, and latch first or shift first, and all that SPI has standardized is select by going low and shifting left, leaving four different ways that different peripheral devices might implement their SPI. So the clock polarity is given a value of 0 if it starts low, 2 if it starts high, the clock phase is given a value of 0 if it latches first, 1 if it shifts first, and add them together to get the "Mode" of the device: Mode0 through Mode3. The SPI required to be supported by SD cards is the start low, latch first kind of SPI, so its Mode0. The VIA SSR is start high, shift first kind of shift register, so the most natural mode for the CX16 is Mode3. And while inverting clock polarity is easy to do in hardware, so connecting a Mode1 servant to a Mode3 master is no issue, connecting a Mode0 or Mode2 servant to a Mode3 Master is trickier.
I don't know enough about electronic circuits to know how to get around that problem for the fastest User Port approach I could come up with, which relies on a VIA serial shift register configuration that drives the serial shift register at half the system clock speed. But for slower approaches, which drive things a bit at a time, I think I've worked out how to handle it. The thing is, when you swap BOTH phase AND polarity, between the VIA "Native" Mode 3 and the SD card Native Mode 0, they both latch data on the rising clock edge and shift data on the falling clock edge. The only thing is that the Mode3 needs an "extra" half clock at the start, to shift out the first data, where Mode0 starts with the first bits of data already asserted. And Mode0 needs an "extra" half clock at the end, so the servant will shift the last bit of MOSI data in.
If generating the clock in software, that is not hard to handle. However, even with VIA hardware generating individual clock pulses, it is still straightforward to do. With literally a single logical gate connected to a GPIO, you can set up an SPI bus that allows a Mode3 Master to imitate a Mode0 master on the SPI Bus. This relies on a an extra output pin that masks out the Master's Serial Clock and holds the SP bus clock low. When accessing a Mode3 Servant, the mask is left off. When accessing a Mode0 servant, for a Mode0 device, the SPI bus clock line stays low until the mask is released, and the SPI clock goes high, so the servant device "doesn't know" that the Master's serial clock did an enter cycle, and instead sees the release of the mask as the first half of a cycle. At the end of the process. for a Mode0 Servant, the mask is pulled down into place again, and that gives the last half cycle the Mode0 servant needs to latch its data.
So that was the idea. How did it go? This also gives a baseline idea about how slow it might be.
The first sketch is about what can fit into the resources now available on VIA1. In the current version of the board, all spots on VIA1 are in use by the system, but removing the PS/2 interface to an MCU frees up four VIA pins. Those pins could be brought out to a block header, alongside the CA1 and CA2 handshake lines that the system does not use. And that is enough to provide a "bit banged" SPI interface: SCLK, MISO, MOSI, Select. Indeed, it is enough to support a generic SPI interface that can be fanned out to support multiple SPI devices, because CA2 can be set to be low or high, so it works as an additional output line. And a second select plus two glue logic chips on the other side of the interface (a "NOR" gate and a Serial Shift Register) supports using the SPI bus to write a select byte for up to eight devices into a serial when the CA2 is selected, and then wire the primary Select line from Port A or B to the serial shift register Output Enable, so the loaded selection from up to eight choices is asserted by a single GPIO.
In data port A, pin 0 is the serial clock, pin 1 is Master Out, Servant in, in Port B pin 0 pin 7 is Master In, Servant Out, and pin 6 Select. CA2 is alternative select, and CA1 is available for an /Alert signal for those SPI devices that have an extra pin to send an interrupt or error warning. No actual serial clock mask is needed for the serial clock line, since the serial clock line is being driven entirely from software.
In the words of Crocodile Dundee, "you can live on it, but it tastes like sh*t". There's no guarantee that I have the fastest possible code, but I think its pretty close, and I get a routine to send/receive a byte at 311 clocks in Mode0, 306 clocks in Mode3, which is an effective maximum bandwidth of around 200Kbps, or just a touch faster than the CX16 I2C bus -- from 14%-33% faster, depending on where it lands in the 150kbps-175kbps range for stable operation.
Now, suppose that you bought that expansion User Port card. In that case, you have all of a VIA to work with, so it's possible to "builda better big-banger" than this. The next sketch builds the input byte "in place" by setting Port B pin0 to input, connected to MISO, and the rest of the PortB pins to output, not connected but acting as bit registers. The output byte is shifted left to put the first output bit into the carry flag, and the rest stored in PortB, so the input is shifted into place at the same time that the output is shifted to the carry flag. MOSI, Select and Alert are in PortA along with 5 device selects. The Device Select routine creates complete Bit=0 and Bit=1 states for Port B, including the correct select flag and the clock low so storing the value of the MISO bit also starts the clock cycle. The clock pin is PortA pin0, so an "INC PORTA" instruction toggles the serial clock line back up again The Bit0 and Bit1 states for PortA are loaded into registers A and X at the beginning of the routine and Y is used for the loop index.
This is a bit better, with around 229 or 234 clock cycles per byte, depending on Mode, so between 34KB and 35KB per second, or between 270 and 280 Kbps throughput ... now 50% faster than CX16 I2C. The wiring is every bit as easy as the PortA-only bit banged approach, just connect MISO, MOSI, SCLK and the Selects to the correct pins on the UserPort header.
Next faster is to have the VIA generate the Serial Clock itself using the handshake lines of Port A. You can put PortA into a mode where it puts a pulse out when Port A is written to. So this approach dedicates PortA to Master Out, Servant In, with Pin7 connected to MOSI, with the bits shifted in by simply shifting them, "ASL CIA2_DataA". The shift works because the 65C02 processor does a shift in memory by reading the RAM, then doing a redundant read of the RAM while shifting the data, then writing the RAM. The NMOS 6502, or 6510 in the C64, does a write for the second clock cycle when the shift is actually being performed, so it would trigger the pulse twice for each shift. MISO is connect to Port B Pin7, which is shifted out. Port B also It holds two select lines and the serial clock "mask" line in pairs of pins, which answer the question of how you shift the MISO data out without messing up the status of the other pins in the register. Pins 6+5, 3+4 and 1+2 are wired together, with pines 6, 4 and 2 being set up as outputs, and 5, 3 and 1 being set up as inputs. When the data is shifted, the input pins don't care, directly, but their neighbor that has data written to it is reflected in the input pin. Then when the shift instruction picks up the data in the Read cycle of Read-Modify-Write, the shift puts the correct value back into the output pin, and then the wire brings that value back to the input pin.
Finally, the SPI bus clock mask to support Mode0/Mode2 devices can be provided with a single two-input AND gate, with one input connected to the Mode3 SCLK and another connected to Pin6 of PortB, works to mask the serial clock line. An alternative is to use a NOR gate. That inverts the clock, but running the inverted clock through both inputs of another NOR gate will bring it right way up again, so the hardware to handling the Mode0, Mode1 and Mode2 translations is a single quad two-input NOR gate, with two of the gates still free.
Anyway, after the first bit is managed "by hand" to handle the store to PortA rather than shifting it, and "unmasking" of the serial, the following bits are "ASL PortA : ASL PortB : ROL", which speeds things up a bit. If I "unroll the loop" by putting that three instruction sequence into the code seven times, I get about 195 clocks for Mode3, 179 clocks for Mode0, which at 8MHz is about 325-350 Kbps. So offloading the work of bit banging the Serial Clock doesn't quite get us up to I2C fast speeds, ... but it's 85%-100% faster than CX16 I2C.
But in any event, all three of these are competing against the I2C routine with one hand tied behind their back, since they aren't actually making use of the serial shift register built into the VIA. So the fourth approach uses the VIA SSR as MISO. It puts the Serial Shift register into the mode where it is driven by an external serial clock, and uses the PortA handshake line to generate that serial clock. That is the approach that runs Mode0 in 99 clocks for a byte and Mode3 for 89 clocks for a byte, or roughly 640-720Kbps bandwidth, in the neighborhood of four times faster than CX16 I2C. .
One quirk of that approach is that SPI works with the Most Significant Bit first, while the VIA SSR is Least Significant Bit first. So part of the third approach is a 256 byte table that gives the mirror image of the MISO byte in the shift register ... because it will finish the eight SPI serial cycles with Bit7 where Bit0 is supposed to be and Bit0 where Bit7 is supposed to be.
Now, it can get faster. You can add an external serial shift register, and use the "byte single shot" mode of the built in CIA SSR to put out a byte within 20 clocks of being stored to the VIA serial shift register ... getting throughput as high as 1Mbps (over 128KB/s), like one of the newer higher speed variants of the I2C bus. But what I'd want is a four mode SPI interface ... and I can only see how to make that work for a Mod3/Mode1 interface. I don't know enough about hardware circuitry to build the circuit to pin the mask down, then release the pin but continue to hold the mask down until the native SCLK line goes high and then release it. I know there is some combination of bit latches and flip flops than can do that, but I don't know exactly what combination that would be.
Still, for most of what I would want to talk to through the interface (up to and included cheating and using an RPi W to copy files from my PC to my Commander X16), I think that 80KB/s might well do me. Plus, being able to support all four modes with two, two-input NOR gates is really appealing, since the standard "74" glue logic comes with four two-input gates per chip.
SPI_VIA_Version2.asm