Custom BASIC interpreter for X16

VincentF · Post by **VincentF** » Fri Sep 10, 2021 2:16 pm

Hello Everybody,

Some days ago I started to think about making a custom BASIC interpreter for the X16. This interpreter would take advantage of the entire memory map and the capabilities of the X16.

It's still "on paper", right now I got a very small prototype that is able to read a number from the prompt and that's all. I started to write the code into the banked RAM, but I get some issues due to wanting to use the other banks so I need to find out how to get my program in one of the banked ROM instead.

To explain how I see this project, I'd like to use the RAM banks to store the variables (and why not the BASIC programs). Also, considering the speed of the X16, I'd like to see if we can tokenize the lines instead of storing plain text. This would take off the need to optimize every bit of BASIC line, making your programs unreadable, but also that needs more processing in order to tokenize the line and "untokenize" it when listing. Running pre-tokenized lines will likely make BASIC runtine faster.

For variable names, I'd see a "VARTAB" section that contains the two characters of the variable followed by the address of it. Lines of BASIC that uses the variable would get a pointer to a line of the VARTAB. in case the variable changes (like for strings), only the VARTAB entry changes its address.

constant values will be tokenized as well so it'll be stored in variable space (or in a constant space) without any entries in VARTAB.

I imagined some variable types like INT16, INT32, FLOAT16, STRING, ARRAY, etc... And even a special type that represents a segment of free space.

I think I'm getting a bit messy in my explanations, so I'm stopping for now. Nothing's fixed, I'm just trying to get some ideas / feedback and maybe some of you may be interested to participate in this ?

@rje If I remember correctly, you worked on a tokenizer already ? Do you think this project may be viable ?

I'm doing that just to learn how that work so if all of this is a bad idea, at least I got some fun imagining the thing and doing the prototype. ?

Kalvan · Post by **Kalvan** » Fri Sep 10, 2021 3:04 pm

In addition to these things, I'd like to add a few more things to the necessary specification:

Reserved words to handle sprite frames and tiles, and syntax (and possible garbage collection) to handle (re)loading of CLUTs without having to resort to a whole bunch of VPEEKs and VPOKEs.

Reserved words to handle access to the PSG and Yamaha YM2151's sound channels and dynamic file linking to handle direct access to the YM3012 DAC and the PCM channel.

paulscottrobson · Post by **paulscottrobson** » Fri Sep 10, 2021 4:26 pm

Ahead of you, but ther more the merrier. https://github.com/paulscottrobson/6502-basic/blob/main/basic/sprites.bas

VincentF · Post by **VincentF** » Fri Sep 10, 2021 4:38 pm

6 minutes ago, paulscottrobson said:

Ahead of you, but ther more the merrier. https://github.com/paulscottrobson/6502-basic/blob/main/basic/sprites.bas

Okay, I wasn't aware that such things already exists ?

Guess I can forget my idea and take a look at this one instead. Thanks for sharing !

Scott Robison · Post by **Scott Robison** » Fri Sep 10, 2021 5:18 pm

No need to forget your idea. Speaking as one who has released a language commercially through a previous employer, you will understand a language so much more by actually implementing one yourself!

I've got ideas for an environment I'd like to create ... just need more time to be able to focus on it.

rje · Post by **rje** » Fri Sep 10, 2021 7:34 pm

I think that, specifically, tokenizing from an editor, line by line, is plenty fast on the X16.

Tokenizing an entire file all at once would probably be a bit laggy. The Python model. But, it might be OK? You'd have to try it and see.

Banks ... just spitballing here ... banks could serially hold tokenized programs. The "heap" would (probably?) have to be in main RAM, i.e. common to all the banks. But the banks are a serial run of "memory hunks", so when you hit a bank boundary, you just skip lightly into the next bank and keep going. That seems OK to me.

* *

And for BASIC it's probably OK... although... I suspect programs need more data space than program space (I mean, a 40K BASIC program is a monster. I've been there.). So banks really are ideal for data.

Now if that data just happens to be program fragments, that's okay too. But ... banks are best for data. Not that you can't make it work -- and it might work really well -- but I'm just thinking this.

Scott Robison · Post by **Scott Robison** » Fri Sep 10, 2021 8:15 pm

I think that tokenizing an entire program at run time, where "tokenizing" is simply replacing multiple character words with one byte tokens, probably wouldn't be *too* bad, but the longer the program, the more obvious it would be. That's what makes the traditional BASIC feedback loop so powerful. The language is dog slow, but it quickly gives you feedback. You don't notice how much time is being spent tokenizing individual lines as you go, but it would become obvious for a long enough program. Plus the whole program might not fit in memory if it wasn't tokenized a line at a time (for a sufficiently long program, anyway).

Python does far more than tokenize in its startup process. It is actually compiling, not just interpreting, the code, though it does compile to pcode instead of machine language. So Python has the benefits of compiling and the drawbacks of interpreting, and the drawbacks of compiling and the benefits of interpreting. It gives you a faster feedback loop than say C++, but still takes more time to compile at the beginning. But then, unlike BASIC, it doesn't have to re-parse every line every time. It is able to detect syntax errors and report them early rather than having to wait until it tries to execute that line.

Snickers11001001 · Post by **Snickers11001001** » Fri Sep 10, 2021 8:28 pm

Love, love love these projects.

If I can posit any advice from the cheap seats (i.e., from someone like me who absolutely has no ability to make such a thing, but who is a user of BASIC and enjoys it), it would be this: There are things that BASIC needs but not everything from 'gen 3' basics like visual basic need to be included or are feasible. Remember at bottom this platform is still an 8 bit machine with no prefetch, no branch prediction, no modern pipelining,  no cache memory, almost no maths, and a 16 bit address space. Its running at 8mhz, maybe, unless they (a) can't get it to work at that speed, in which case it will be a 4mhz machine, or (b) if they release the faster X8 FPGA version (that won't have the banked memory, ZP space, or space in the pages below $0800 that the X16 has) which probably won't be able to fit all your features anyway.

Just take a look at the discussion near the end of the "BASIC 2 vs BASIC 7" thread and the impact of just a few more instructions in the byte fetching/parsing core between the C64 and the later Plus/4 and C128 in terms of the negative impact on performance for the later machines with better BASICs. If you're having to bank-switch, for example, surely it takes a hit.

Tokenizing a lot of stuff inline (e.g., constants, jumps, variable memory locations) is a great idea, but I suggest a simple escape code structure using byte codes. Parser finds, say petscii code for '@' not inside quotes and it knows the next two bytes are a small-int in 16 bit signed format; it finds petscii code for [english pound] (which looks a bit like a mutant F), it knows the next 5 bytes are the exponent and mantisa for a float; it finds token for 'goto' or 'gosub' it knows the next two bytes are the actual 6 bit address for the destination of the jump in the code, instead of the petscii numeric representation of a line number;  it finds petscii code for "%" it knows the next two bytes are the 16 bit address to the value followed by the name of an int style variable in memory. (At execution it just needs to fetch the value, during LIST operation it grabs the variable name at that address+2 until it hits the terminator). Yeah, OK, the modern way to do many of these things would be with a hash table, but I caution you to consider the performance impact on an 8 bit machine.

If you use the idea of inlining 16 bit addresses for jump locations to speed up execution, of course, then there are other issues. With line numbers, your "LIST" routine needs only follow the 16 bit address and then grab the line number at that address and put it on the screen during a 'LIST"; but with LABLES, you will need to set up a data structure (probably a linked list) that can be consulted by the interpreter during code LIST operations to regurgitate the labels or line numbers when the user lists the code and that metadata has to get saved with the program. That's actually a better place to use banked memory... the performance cost of swapping banks is not as important when listing the code. I don't think its feasible to tokenize at runtime, it needs to be as you enter things.

Scott Robison · Post by **Scott Robison** » Fri Sep 10, 2021 8:58 pm

30 minutes ago, Snickers11001001 said:

If you use the idea of inlining 16 bit addresses for jump locations to speed up execution, of course, then there are other issues. With line numbers, your "LIST" routine needs only follow the 16 bit address and then grab the line number at that address and put it on the screen during a 'LIST"; but with LABLES, you will need to set up a data structure (probably a linked list) that can be consulted by the interpreter during code LIST operations to regurgitate the labels or line numbers when the user lists the code and that metadata has to get saved with the program. That's actually a better place to use banked memory... the performance cost of swapping banks is not as important when listing the code. I don't think its feasible to tokenize at runtime, it needs to be as you enter things.

I like what you wrote, and wrote many of the same myself months ago in another thread. I think an improved tokenizer that tokenized the entire line would help BASIC runtime a lot. But even if you have a tokenizer that know to create two tokens for "GOTO 1234" (one for GOTO, one for 16 bit value), all you've got is the next line number. You still have to find where that line is in memory in the linked list of BASIC lines.

But in general I am in complete agreement. A better tokenization scheme could go a long way to improving BASIC performance.

Edit: Sorry, that doesn't read well. I don't mean "I had a great idea and good for you" I meant "you have a great idea and I agree". It comes off more self congratulatory than I intended.

TomXP411 · Post by **TomXP411** » Fri Sep 10, 2021 11:53 pm

2 hours ago, Snickers11001001 said:

Love, love love these projects.

If I can posit any advice from the cheap seats (i.e., from someone like me who absolutely has no ability to make such a thing, but who is a user of BASIC and enjoys it), it would be this: There are things that BASIC needs but not everything from 'gen 3' basics like visual basic need to be included or are feasible. Remember at bottom this platform is still an 8 bit machine with no prefetch, no branch prediction, no modern pipelining,  no cache memory, almost no maths, and a 16 bit address space. Its running at 8mhz, maybe, unless they (a) can't get it to work at that speed, in which case it will be a 4mhz machine, or (b) if they release the faster X8 FPGA version (that won't have the banked memory, ZP space, or space in the pages below $0800 that the X16 has) which probably won't be able to fit all your features anyway.

Just take a look at the discussion near the end of the "BASIC 2 vs BASIC 7" thread and the impact of just a few more instructions in the byte fetching/parsing core between the C64 and the later Plus/4 and C128 in terms of the negative impact on performance for the later machines with better BASICs. If you're having to bank-switch, for example, surely it takes a hit.

Tokenizing a lot of stuff inline (e.g., constants, jumps, variable memory locations) is a great idea, but I suggest a simple escape code structure using byte codes. Parser finds, say petscii code for '@' not inside quotes and it knows the next two bytes are a small-int in 16 bit signed format; it finds petscii code for [english pound] (which looks a bit like a mutant F), it knows the next 5 bytes are the exponent and mantisa for a float; it finds token for 'goto' or 'gosub' it knows the next two bytes are the actual 6 bit address for the destination of the jump in the code, instead of the petscii numeric representation of a line number;  it finds petscii code for "%" it knows the next two bytes are the 16 bit address to the value followed by the name of an int style variable in memory. (At execution it just needs to fetch the value, during LIST operation it grabs the variable name at that address+2 until it hits the terminator). Yeah, OK, the modern way to do many of these things would be with a hash table, but I caution you to consider the performance impact on an 8 bit machine.

If you use the idea of inlining 16 bit addresses for jump locations to speed up execution, of course, then there are other issues. With line numbers, your "LIST" routine needs only follow the 16 bit address and then grab the line number at that address and put it on the screen during a 'LIST"; but with LABLES, you will need to set up a data structure (probably a linked list) that can be consulted by the interpreter during code LIST operations to regurgitate the labels or line numbers when the user lists the code and that metadata has to get saved with the program. That's actually a better place to use banked memory... the performance cost of swapping banks is not as important when listing the code. I don't think its feasible to tokenize at runtime, it needs to be as you enter things.

This is not far off from what I suggested a while back. Tokenize not just the keywords (PRINT, GOTO, etc) but also variable names and numeric literals. Assuming 01-12 are available as token codes, we could use:

01 - 8 bit byte (an integer literal between 0 and 255)

02 - 16 bit integer (any integer literal between -32768 and 65535)

03 - 40 bit float (any numeric literal with a decimal point, eg: 3.14 or 1.0)

04 - byte variable (# sigil or DIM x AS BYTE)

05 - integer variable (% sigil or DIM x AS INT)

06 - float variable (! sigil or DIM x AS FLOAT)

07 - string variable ($ sigil or DIM x AS STRING)

08 - label

09 - start of subroutine

10 - start of function

11 - end of function or subroutine

PRINT 1234 gets changed to

94 02 34 12

PRINT A$ might get converted to

94 07 01

and A$="HELLO" becomes

07 01 B2 "HELLO"

You could also change types on the fly by referencing a variable with a different sigil. So

X = $1234

could be referenced with the byte sigil and would act like a 2 byte array:

PRINT X#

34

PRINT X#(1)

12

(Remember that arrays are zero-based)

This implies that arrays are nothing special: array variables would simply reserve more than 1 space in the heap, so:

DIM NAMES$(25) AS STRING would reserve 50 bytes on the heap, and if you recalled NAMES%(x), you would get back a 2-byte value, which is actually the pointer to the string.

Where this comes in useful is creating large, arbitrary data arrays. For example, rooms in an adventure game.

DIM ROOM#(1024) creates a 1K chunk of memory that can be used for any purpose. You could then load rooms in on-demand from disk, every time the player moves from one room to another.

Labels and subroutine names would simply be more entries on the variable table.

The variable table itself is super simple:

01-02: data/code address

03: length of variable name

04-?? text of variable name

There are no types on the variable table, because the type is determined at runtime based on the token code. The token code is determined at compile time based on the sigil or a DEF <BYTE | INT | FLOAT> statement.

There are a ton of advantages to this system. Right now, the BASIC routines all have to parse their own data. Doing it this way means the data is pre-parsed. The routines simply read the parameters directly out of the program stream.

The actual program text can be more compact, too. You don't store spaces. You don't store commas in parameter sequences. Those just discarded and re-created if the program is detokenized (listed).