Noodling with Assembly code on the X16 (using cl65)

rje · Post by **rje** » Fri Jan 07, 2022 12:06 am

I saw Bruce's SWEETER16, so I decided to sit down and write some assembly on the X16. First, I want to use the X16's pseudo-registers in zero page... and have some scratch space right above there, too.

;

; kungpao16.s

; seeing how far I can go to create a SWEET16 variant.

;

R0L = $02 ; accumulator low

R0H = $03 ; accumulator high

R14H = $1f ; status high

R15L = $20 ; PC low

R15H = $21 ; PC high

;

; I'll set up a 16-bit short stack, just for fun.

; I've got from $22 to $7f

;

XRL = $22 ; extended 16 bit stack, low bytes

XRH = $52 ; extended 16 bit stack, high bytes

Next, I want to use the full CPU.

.pc02 ; enables 65C02

I'm using "golden RAM". $400 for my code, and $700 for data. So I'll put some data in place.

;

; Set up some test data.

;

LDX #$11 ; $0700+ = $11 $41 $15 $00

STX $0700

LDX #$41

STX $0701

LDX #$15

STX $0702

LDX #$0

STX $0703

LDX #$7 ; R15 (PC) = $0700

STX $21

LDX #$0 ; ... and now X is zero (conveniently)

STX $20

Now I can write a loop that checks each byte in the data stream, and quits when it finds a zero.

;

; The start address is in R15.

;

; Fetch the instruction (zp fetch!)

;

FETCH: LDA (R15L)

BEQ DONE ; 00 = PROGRAM END

TAY ; stash it in Y

AND #$0F

STA XRH,X ; put the low nybble into RxH

TYA ; bring it back from Y

AND #$F0

STA XRL,X ; put the upper nybble into RxL

INX

;

; Increment R15 (the PC)

;

NEXT: INC R15L

BNE FETCH

INC R15H

BRA FETCH

DONE: RTS

I use SYS $400 to run it, and here's what it does: it reads the bytes starting at $700 (and stops when it finds the zero at $703). It splits each byte into lower and upper nybbles, putting them into zero page at $22,X and $52,X respectively:

$10, $40, and $10 starting at $0022, and $01, $01, $05 starting at $0052.

rje · Post by **rje** » Fri Jan 07, 2022 2:24 am

Ideally, the opcodes would be organized in some rational way. Instead, Woz just has them however he likes and uses a table.

But I was taught (if that's the word) that complex instruction codes ideally are organized rationally for decoding, rather than jumptabling.

On the gripping hand, though, Woz' jump table is only 64 bytes? That's pretty small. Maybe I can decode 32 instructions in less than 64 bytes (and maybe not!), but I certainly can't dispatch fast with decoding logic.

Ed Minchau · Post by **Ed Minchau** » Fri Jan 07, 2022 4:52 am

On 1/6/2022 at 7:24 PM, rje said:

Ideally, the opcodes would be organized in some rational way. Instead, Woz just has them however he likes and uses a table.

But I was taught (if that's the word) that complex instruction codes ideally are organized rationally for decoding, rather than jumptabling.

On the gripping hand, though, Woz' jump table is only 64 bytes? That's pretty small. Maybe I can decode 32 instructions in less than 64 bytes (and maybe not!), but I certainly can't dispatch fast with decoding logic.

One of the best virtual machine dispatch systems I've seen is the one in Acheron by WhiteFlame. Acheron is another 16 bit virtual machine like Sweet16, it takes only 11 cycles for dispatch.

https://www.white-flame.com/acheron/

rje · Post by **rje** » Fri Jan 07, 2022 3:48 pm

Isn't Acheron an entire operating system? Oh, wait, that's something else I'm remembering.

But Acheron is ... well suffice to say it doesn't look like it's 300 bytes.

Ed Minchau · Post by **Ed Minchau** » Fri Jan 07, 2022 4:51 pm

On 1/7/2022 at 8:48 AM, rje said:

Isn't Acheron an entire operating system? Oh, wait, that's something else I'm remembering.

But Acheron is ... well suffice to say it doesn't look like it's 300 bytes.

Yeah, more like 2kb, but unlike sweet16 you can have up to 128 VM commands.

rje · Post by **rje** » Fri Jan 07, 2022 7:12 pm

On 1/7/2022 at 10:51 AM, Ed Minchau said:

Yeah, more like 2kb, but unlike sweet16 you can have up to 128 VM commands.

Hmmmmm possibilities.

So I was thinking that the compact Woz-tiny version still has utility with the KERNAL or BASIC on the X16. Assuming there are enough* 16-bit ops in those ROMs in non-speed-critical places.

* More than 300 bytes' worth... and where the space savings makes other things possible.

But Acheron... I mean that could fit handily in a ROM bank, with room for whatever else would go there.

Or... assume even more duties of the KERNAL, but I am deliberately ignorant of the KERNAL so I don't know.

rje · Post by **rje** » Fri Jan 07, 2022 7:22 pm

I've never really thought through what is required to do a 16 bit operation.

Suppose you had two 16 bit values in zero page, say at $02-$03 and at $04-$05, and you wanted to add them together. How would I try to do this?

I'd first add the low bytes together.

ADD_LO:

LDA $02

ADC $04

STA $02 ; I guess we put the result back into $02. I guess.

BCC ADD_HI

; handle carry set

CLC

INC $03 ; add carry to hi byte of first number

; what if THAT causes the carry to get set? I dunno.

ADD_HI:

LDA $03

ADC $05

; I don't know how I should handle carry in this case.

STA $03 ; back into $03, I guess.

DONE:

RTS

As far as I can tell, this'll add the lower two bytes, add the carry to one of the upper bytes, then add the two upper bytes together.

It SEEMS to me like this would add two 16 bit numbers.

It also seems to be quite expensive... I don't know, but it looks like it's AT LEAST 19 bytes long. I think if you had to do this beyond zero page, it would be like 26 bytes or more?

If you had 16 different places that used somewhat leisurely 16 bit adds, subtractions, and compares in your 16K ROM, then something like SWEET16 is worth it.

Ed Minchau · Post by **Ed Minchau** » Fri Jan 07, 2022 7:35 pm

Simpler than that.

LDA $02

CLC

ADC $04

STA $02

LDA $03

ADC$05

STA $03

The carry bit from the low bytes is set if $02+$04>FF.

rje · Post by **rje** » Fri Jan 07, 2022 7:39 pm

Ahhh because ADC is ADD WITH CARRY. 13 bytes in ZP, 19 otherwise.

Thank you!

BruceMcF · Post by **BruceMcF** » Sun Jan 09, 2022 12:01 am

On 1/6/2022 at 9:24 PM, rje said:

Ideally, the opcodes would be organized in some rational way. Instead, Woz just has them however he likes and uses a table.

But I was taught (if that's the word) that complex instruction codes ideally are organized rationally for decoding, rather than jumptabling.

On the gripping hand, though, Woz' jump table is only 64 bytes? That's pretty small. Maybe I can decode 32 instructions in less than 64 bytes (and maybe not!), but I certainly can't dispatch fast with decoding logic.

It's not all at random, though it's definitely not like a microcoded processor instruction set ... more like the 6502 which feels free to take an opcode that doesn't make sense for one type of operation and use it for another. That is

aaa d rrrr, address-mode, direction, register

rrrr is the 16bit pseudo register, R0-R15

d=0: operand to ACC, d=1, ACC to operand

aaa is the operand address mode

aaa=000, immediate (followed by 16bit immediate value)

aaa=001, register direct

aaa=010, register indirect post-increment (lower 8bits, upper 8bits cleared)

aaa=011, register double indirect post-increment

aaa=100, pre-decrement register indirect

aaa=110, pre-decrement register double indirect

... but "0000 rrrr" is a nonsense action (eg, you cannot store the accumulator to the number 768), so instead "rrrr" is a non-register operation. With all of the indirect loads and store being post-increment, you only need one direction of pre-decrement to make a stack. HOWEVER, the single byte pre-decrement needs load AND store, so together they can do a move of a block of data from "back to front", if source is below destination and they overlap. So the single byte "POP" has both directions but the double byte one (to allow 16bit value stacks) only needs one direction.

Then there is arithmetic:

aaa s rrrr, arithmetic-op, sign, register

s= sign, 0=+ (plus), 1=- (minus)

aaa = 101, sum, ACC = ACC +/- register, set branch carry, zero, negative conditions

aaa = 110, sum value = ACC +/- register, set branch carry, zero, negative conditions, discard value

aaa = 111, inc/decrement, register = register +/- 1

Of course, 6 load/store operations and 3 arithmetic operations do not fit into 3bits, except the comparison operation only needs to subtract, and double byte pre-decrement only needs to work in one direction, so that lets it fit together like a jigsaw puzzle.

Edit: Note that while the register in the bottom and the instruction at the top is for functional reasons, there is ONE instruction that is almost implied by the design, which is the CPR Rn, since when beginning execution, the four operation bits end up in bits 1-4 of the Y register (for the instruction table look-up), and CPR uses that to give the index of the target for the subtraction, which the CPR instruction places in R13 rather than R0 (the accumulator). So the CPR opcode has to be $Dn, unless the CPR result register is relocated.

And then that implies that the two-byte POP instruction is at $Cn, by the "jigsaw puzzle" logic above.

Since I was attempting a re-implementation, I focused on the description of the functioning of the operations rather than Woz's implementation.

However, even with a different dispatch model, if trying to squeeze object size in a "Sweet 16 replacement", rather than optimizing for speed, I could imagine have a single indirect load and a single indirect store routine, which works out from the bits of the opcode and the status of the carry flag whether it is pre-decrement or post-increment and whether it is a single or double byte operation, covering 7 operations in two routines. Direct register moves could be handled by putting source in Y and destination in X, at the cost of using absolute rather than direct addressing for the Y-indexed operation, giving one routine the two direct ones. One could imagine the immediate register load being run by the two-byte accumulator load, setting the indirect source register to R15, the PC register, and using Y-indexed store, so the immediate load is taken over by the single indirect load routine as well.

Then at the cost of three more zero page bytes ... two more bytes in a dedicated "register 17" initialized to $0001, and one set to either $80 or $00 based on whether adding or subtracting, setting up the correct target and operand index in X and Y would all allow all five arithmetic operations to be done in a single routine. If that was done by shifting the instruction one bit to the left and using the carry flag and sign flag to split the code set into quarters, you might restrict the jump table to the $0n instructions, making it only 26-32 bytes long.