Paging of stack

Scott Robison · Post by **Scott Robison** » Mon May 03, 2021 4:01 am

36 minutes ago, Michael Kaiser said:

The reason I asked is because the C128 could do it. I had a C128 30+ years ago and did not have the skill at the time to write a task switching kernel. I was hopeful that the X16 would be able to do this and had hoped to write a task switching kernel that could do this. And yes, I saw both GeckOS and Contiki and their ability to switch tasks by subdividing the stack. I just found that limiting. I may still do this by subdividing the stack.

You are correct, not having the option is more limiting than having the option. But the option won't be there, so manual copying of the stack / zero page will be required.

I kind of had to do something similar (though much less complex) for my BASIC PREPROCESSOR so that I could recursively call into BASIC to use the crunch routine from assembly. In my case when I SYS $0401 to my routine, I save all the zero page & $200-$3FF values that BASIC depends on, JSR into BASIC (which is already running my BASIC program), then restore them. In my case it is only I think about six bytes or so.

Roman K · Post by **Roman K** » Mon May 03, 2021 9:47 am

Is it possible to implement additional mapping via the extension card? Is there any documentation on how banking implemented at all? And how can memory values be overridden. Like device listens to some memory addresses (IO to memory mapping as I understand) and provides its own memory instead. That's the way it works for devices, but I'm just curious - what happens with original memory then?

ZeroByte · Post by **ZeroByte** » Mon May 03, 2021 3:34 pm

I wouldn't think so - The thing about DMA is that either the DMA device is driving, or the CPU is driving. I don't think there's any way for an expansion slot to take over the bus and override the glue logic, which is what would need to happen for an expansion device to supplant a system peripheral. A device in the expansion slot can listen to the entire address range - e.g. a debugger interface card. But I don't think it's going to have the power to say, "No no. I don't think VERA should respond to $9f24 - that's me, and I'm going to do X-Y-Z instead." The glue logic is going to be sending CE signals to whatever chips are supposed to receive them for any given address/range. E.g. if the bus has $9f40, then the glue logic will be sending CE to the YM2151, regardless of whether that was asserted by the CPU or a DMA device.

Roman K · Post by **Roman K** » Mon May 03, 2021 3:51 pm

@ZeroByte I mean overriding the memory, not another device. Like my device with RAM on it listens to the address bus and responds to the CPU if some memory address is requested. If there is memory chip already responsible for that address, how is that conflict resolved? That should happen on real devices. Or there is a dedicated IO range that is not handled by RAM and can be used by devices?

ZeroByte · Post by **ZeroByte** » Mon May 03, 2021 4:03 pm

If you want to override memory, the best you could do is to monitor writes and then do a DMA and come in behind the CPU and overwrite an address with something else. The glue logic is going to activate the RAM IC for address X whenever that address is asserted on the bus. You can't make a device which assumes the role of the bus devices and supplant the glue logic - remember that even RAM is just a device on the bus.

BruceMcF · Post by **BruceMcF** » Tue May 04, 2021 5:57 am

On 5/3/2021 at 11:51 PM, Roman K said:

@ZeroByte I mean overriding the memory, not another device. Like my device with RAM on it listens to the address bus and responds to the CPU if some memory address is requested. If there is memory chip already responsible for that address, how is that conflict resolved? That should happen on real devices. Or there is a dedicated IO range that is not handled by RAM and can be used by devices?

There is a dedicated IO range that can be used by devices ... that is the sets of 32byte control register addresses for I/O in $9FFF, which three sets used by the system and five available on the expansion slots. Beyond that, DMA, "Direct Memory Access" would be the way to go, just take over the bus and write the data directly to the desired RAM location. At the upper limit, where only one side of the transfer is on the CX16 bus, that can proceed at one byte per cycle, so a "binary page" of 256 bytes can be moves in 256 cycles, whereas a general purpose (zp),Y copy is:

COPY: LDY #0

COPY1: LDA (SRC),Y : STA (DEST),Y : INY : BNE COPY1

COPY2: RTS

Where if that is page-aligned, that is 15 cycles per byte moved plus overhead.

Now, a page at a time is a lot of time for interrupts to be suspended, so a more general purpose DMA board might be oriented around 16byte, 32byte or 64byte chunks, depending on how long you figure you can leave interrupts suspended without messing with performance. If the DMA chunk reference autoincrements and the setting of the lower byte of the main bus target or destination chunk register triggers the DMA, you might have a loop of:

PASTE: ; A=chunk lo, Y=target-hi, as chunk reference, X = #chunks, 0-base, DMA source address is already set-up

PASTE1: STY dmatrg+1 : STA dmatrg : DEX : BEQ PASTE3

PASTE2: INC dmatrg : DEX : BNE PASTE2

PASTE3: RTS

That's an inner loop of 1.2-1.7 clocks per byte moved, plus overhead, for chunks of 16-64 bytes. Ideally the auto-increment system for the DMA RAM would be the same auto-increment system as for Vera, both to reduce the amount of new things that need to be learned, and also to integrate with Vera.

For Vera, and conceivably also for other I/O, you'd also have a fixed IO page target address function:

VPASTE: ; A=IOaddr, X = #chunks, 0-base, I/O target settings, DMA source address is already set-up

VPASTE1: STA dmaiotrg : DEX : BNE PASTE1

VPASTE2: RTS

Michael Kaiser · Post by **Michael Kaiser** » Mon May 24, 2021 1:12 pm

Ok. So copy zero page and stack to bank for task 1 and store registers, copy bank and zero page from bank for task 2 and retrieve registers. Takes about 1.5 microseconds at 8Mhz. That might actually be usable.

BruceMcF · Post by **BruceMcF** » Mon May 24, 2021 1:44 pm

30 minutes ago, Michael Kaiser said:

Ok. So copy zero page and stack to bank for task 1 and store registers, copy bank and zero page from bank for task 2 and retrieve registers. Takes about 1.5 microseconds at 8Mhz. That might actually be usable.

And depending on the task, may not need to copy all. Move the stack down to the bottom quarter of the stack page, allocate a 64byte section of zero page that you reserve, you only need to copy two chunks of 64 bytes up, two chunks of 64 byte down.

Serentty · Post by **Serentty** » Sun Jun 27, 2021 6:41 pm

Just thinking about this, I think I might have come up with a method for swapping out the zero page that's even faster than just getting the 65C02 to copy it at 8 MHz—store the other zero page banks in VRAM, and make use of the VERA's autoincrementing ports. I'm not sure of the exact cycle counts, but I would imagine you would save some cycles by only having to do one indexed access instead of two.

On 4/30/2021 at 12:02 AM, kelli217 said:

Thought was seriously given to using a processor that supports the 65C02 instruction set but also already has relocatable stack and direct pages. However, there was still a problem; this processor has part of the address bus and data bus sharing the same lines via multiplexing, and the external demux logic was determined to be too much to deal with.

And that processor is the 65C816.

I thought the 65C816 was still a consideration (just not making use of its banking features), but they weren't going to test it until the design is otherwise finalized, which seems like an oddly suspenseful way to do it compared to just occasionally checking if the 65C816 works in the current prototypes.

BruceMcF · Post by **BruceMcF** » Mon Jun 28, 2021 5:18 am

10 hours ago, Serentty said:

Just thinking about this, I think I might have come up with a method for swapping out the zero page that's even faster than just getting the 65C02 to copy it at 8 MHz—store the other zero page banks in VRAM, and make use of the VERA's autoincrementing ports. I'm not sure of the exact cycle counts, but I would imagine you would save some cycles by only having to do one indexed access instead of two.

I thought the 65C816 was still a consideration (just not making use of its banking features), but they weren't going to test it until the design is otherwise finalized, which seems like an oddly suspenseful way to do it compared to just occasionally checking if the 65C816 works in the current prototypes.

Since the bus timings are tight, it really would be premature to check until they have a board working with the hardware they are going to use at 8MHz.

But I also would not be surprised if that eventually gets pushed out to "if you want that, get a bus mastering 65816 expansion card".