Custom BASIC interpreter for X16

Scott Robison · Post by **Scott Robison** » Sat Sep 11, 2021 2:23 am

2 hours ago, TomXP411 said:

This is not far off from what I suggested a while back. Tokenize not just the keywords (PRINT, GOTO, etc) but also variable names and numeric literals. Assuming 01-12 are available as token codes, we could use:

01 - 8 bit byte (an integer literal between 0 and 255)

02 - 16 bit integer (any integer literal between -32768 and 65535)

03 - 40 bit float (any numeric literal with a decimal point, eg: 3.14 or 1.0)

04 - byte variable (# sigil or DIM x AS BYTE)

05 - integer variable (% sigil or DIM x AS INT)

06 - float variable (! sigil or DIM x AS FLOAT)

07 - string variable ($ sigil or DIM x AS STRING)

08 - label

09 - start of subroutine

10 - start of function

11 - end of function or subroutine

PRINT 1234 gets changed to

94 02 34 12

PRINT A$ might get converted to

94 07 01

and A$="HELLO" becomes

07 01 B2 "HELLO"

You could also change types on the fly by referencing a variable with a different sigil. So

X = $1234

could be referenced with the byte sigil and would act like a 2 byte array:

PRINT X#

34

PRINT X#(1)

12

(Remember that arrays are zero-based)

This implies that arrays are nothing special: array variables would simply reserve more than 1 space in the heap, so:

DIM NAMES$(25) AS STRING would reserve 50 bytes on the heap, and if you recalled NAMES%(x), you would get back a 2-byte value, which is actually the pointer to the string.

Where this comes in useful is creating large, arbitrary data arrays. For example, rooms in an adventure game.

DIM ROOM#(1024) creates a 1K chunk of memory that can be used for any purpose. You could then load rooms in on-demand from disk, every time the player moves from one room to another.

Labels and subroutine names would simply be more entries on the variable table.

The variable table itself is super simple:

01-02: data/code address

03: length of variable name

04-?? text of variable name

There are no types on the variable table, because the type is determined at runtime based on the token code. The token code is determined at compile time based on the sigil or a DEF <BYTE | INT | FLOAT> statement.

There are a ton of advantages to this system. Right now, the BASIC routines all have to parse their own data. Doing it this way means the data is pre-parsed. The routines simply read the parameters directly out of the program stream.

The actual program text can be more compact, too. You don't store spaces. You don't store commas in parameter sequences. Those just discarded and re-created if the program is detokenized (listed).

Yes, this is very similar to what I did for PCBoard, except I was not as concerned as maybe I should have been about token size. I created two byte tokens (it was for 16 bit DOS systems, after all). Every statement the language supported was a two byte signed positive number (no zero). Every function or operator that the expression engine supported was a two byte signed negative number. Variables were just an index into an array of dynamically typed values. Zero was used to mark the end of an expression where was was valid. So a PRINT statement that looked something like this:

PRINT a, b*2, c+3

Would be tokenized as:

$0001 $0003 $0001 $0000 $0002 $0003 $FFFF $0000 $0004 $0005 $FFFE $0000

Token meanings are:

PRINT statement

Count of expressions to evaluate for PRINT

Index of variable a in vartab (pushed on eval stack)

End of first expression (pop result from eval stack and print it)

Index of variable b in vartab (pushed on eval stack)

Index of constant 2 in vartab (pushed on eval stack; I stored constants as variables with impossible to generate names for runtime simplicity and stack evaluation consistency)

Multiplication operator (pop two values, multiply them, push result)

End of second expression (pop result from eval stack and print it)

Index of variable c in vartab (pushed on eval stack)

Index of constant 3 in vartab (pushed on eval stack)

Addition operator (pop two values, add them, push result)

End of third expression (pop result from eval stack and print it)

The value of the tokens are not exact, but it illustrates the idea. My tokenized / pcode scripts could use an entire 64 KB for tokens, or 32,768 tokens max, plus extra memory for the vartab. Each token was stored at a given offset in the token array, so GOTO/JMP (whether used explicitly or as part of a structured construct) would just update the PC to the new token index.

Obviously, on an 8 bit system, you'd want more compact tokens. And for an interactive interpreter, you'd have to store them more like you suggested so that as lines were added, edited, and deleted, the program would still be runnable. My language had a script compiler that translated from a text file to the tokenized form which meant I didn't have to worry about LISTing it back out later (though some enterprising third party did reverse engineer it and created a decompiler for it which was able to do a great job, as I did not attempt to write an optimizing compiler). I remember the README file that came with it being critical of how simplistic it was and how it should have been done in a different way to generate machine code, but that wasn't the point. We wanted to let people customize the system. And later when we ported PCBoard to OS/2, most scripts continued to run unmodified because we didn't tie them to a particular machine representation. The only exceptions were if people used the PEEK & POKE and similar statements because ... I was able to do it, so why not? ?

https://www.99-bottles-of-beer.net/language-ppl-562.html gives a longer if not completely useful example.

Edit: Note that I started off with one byte tokens. The initial ambition was to create a simple scripting language. As things happen at times, as people saw it able to do more and more, a lot of "what if we added..." features crept in. Enough to overflow a byte. A more complex encoding would have been possible, but I was still a fairly young developer (I was about 25 when I did this project) and either didn't consider it and/or didn't have the time to rework the portions I had already written.

TomXP411 · Post by **TomXP411** » Sat Sep 11, 2021 6:29 am

4 hours ago, Scott Robison said:

Edit: Note that I started off with one byte tokens. The initial ambition was to create a simple scripting language. As things happen at times, as people saw it able to do more and more, a lot of "what if we added..." features crept in. Enough to overflow a byte. A more complex encoding would have been possible, but I was still a fairly young developer (I was about 25 when I did this project) and either didn't consider it and/or didn't have the time to rework the portions I had already written.

Yeah, it would help to be a little more dense on an 8 bit computer. Even 4 escape tokens would allow for 1024 commands, on top of 124 "prime" commands (ones with a single byte token), while keeping 32-127 as ASCII characters.

paulscottrobson · Post by **paulscottrobson** » Sat Sep 11, 2021 9:48 am

Bank switching is workable. The 6502 BASIC is designed to be bank switched in 8k in any arrangement you fancy (or not at all) and there's only a slight speed hit, slightly worse on an original 6502 which doesn't have jmp (aaaa,x) - the modules are

the inline 65C02 assembler

error handling

device specific stuff I/O Files etc. Some stuff is standardised like text I/O, some is system specific like Sprites and Sound functionality.

system specific extensions - Sprites, Sounds, VPoke, that sort of stuff.

floating point (not actually written but the hooks and conversion stuff is all there)

Input/Output/Command Line/Program editing stuff

String handling & Management

Tokeniser/Detokeniser

Variable Management

Core (everything else)

Most of the time it lives in the core, which contains most of the actual program execution code and integer 32 bit arithmetic. For the other commonly called internal things (Variable Management, strings and FP) the overhead of the task switching is minimal compared, and the rest of it doesn't really affect the speed of the programmes. There is a small amount of code duplication where I decided it was worth duplicating functionality rather than switching back to another module, but very little.

Snickers11001001 · Post by **Snickers11001001** » Sat Sep 11, 2021 5:01 pm

7 hours ago, paulscottrobson said:

Bank switching is workable. The 6502 BASIC is designed to be bank switched in 8k in any arrangement you fancy (or not at all) and there's only a slight speed hit, slightly worse on an original 6502 which doesn't have jmp (aaaa,x) - the modules are

the inline 65C02 assembler

error handling

device specific stuff I/O Files etc. Some stuff is standardised like text I/O, some is system specific like Sprites and Sound functionality.

system specific extensions - Sprites, Sounds, VPoke, that sort of stuff.

floating point (not actually written but the hooks and conversion stuff is all there)

Input/Output/Command Line/Program editing stuff

String handling & Management

Tokeniser/Detokeniser

Variable Management

Core (everything else)

Most of the time it lives in the core, which contains most of the actual program execution code and integer 32 bit arithmetic. For the other commonly called internal things (Variable Management, strings and FP) the overhead of the task switching is minimal compared, and the rest of it doesn't really affect the speed of the programmes. There is a small amount of code duplication where I decided it was worth duplicating functionality rather than switching back to another module, but very little.

Wow, that's actually quite mature already. Is there a 'howto' to try it out on the X16 emulator? Is it a custom rom or something?

paulscottrobson · Post by **paulscottrobson** » Sun Sep 12, 2021 6:10 am

It should run as is, though it's running unbanked in low memory.