2 hours ago, TomXP411 said:
This is not far off from what I suggested a while back. Tokenize not just the keywords (PRINT, GOTO, etc) but also variable names and numeric literals. Assuming 01-12 are available as token codes, we could use:
01 - 8 bit byte (an integer literal between 0 and 255)
02 - 16 bit integer (any integer literal between -32768 and 65535)
03 - 40 bit float (any numeric literal with a decimal point, eg: 3.14 or 1.0)
04 - byte variable (# sigil or DIM x AS BYTE)
05 - integer variable (% sigil or DIM x AS INT)
06 - float variable (! sigil or DIM x AS FLOAT)
07 - string variable ($ sigil or DIM x AS STRING)
08 - label
09 - start of subroutine
10 - start of function
11 - end of function or subroutine
PRINT 1234 gets changed to
94 02 34 12
PRINT A$ might get converted to
94 07 01
and A$="HELLO" becomes
07 01 B2 "HELLO"
You could also change types on the fly by referencing a variable with a different sigil. So
X = $1234
could be referenced with the byte sigil and would act like a 2 byte array:
PRINT X#
34
PRINT X#(1)
12
(Remember that arrays are zero-based)
This implies that arrays are nothing special: array variables would simply reserve more than 1 space in the heap, so:
DIM NAMES$(25) AS STRING would reserve 50 bytes on the heap, and if you recalled NAMES%(x), you would get back a 2-byte value, which is actually the pointer to the string.
Where this comes in useful is creating large, arbitrary data arrays. For example, rooms in an adventure game.
DIM ROOM#(1024) creates a 1K chunk of memory that can be used for any purpose. You could then load rooms in on-demand from disk, every time the player moves from one room to another.
Labels and subroutine names would simply be more entries on the variable table.
The variable table itself is super simple:
01-02: data/code address
03: length of variable name
04-?? text of variable name
There are no types on the variable table, because the type is determined at runtime based on the token code. The token code is determined at compile time based on the sigil or a DEF <BYTE | INT | FLOAT> statement.
There are a ton of advantages to this system. Right now, the BASIC routines all have to parse their own data. Doing it this way means the data is pre-parsed. The routines simply read the parameters directly out of the program stream.
The actual program text can be more compact, too. You don't store spaces. You don't store commas in parameter sequences. Those just discarded and re-created if the program is detokenized (listed).
Yes, this is very similar to what I did for PCBoard, except I was not as concerned as maybe I should have been about token size. I created two byte tokens (it was for 16 bit DOS systems, after all). Every statement the language supported was a two byte signed positive number (no zero). Every function or operator that the expression engine supported was a two byte signed negative number. Variables were just an index into an array of dynamically typed values. Zero was used to mark the end of an expression where was was valid. So a PRINT statement that looked something like this:
PRINT a, b*2, c+3
Would be tokenized as:
$0001 $0003 $0001 $0000 $0002 $0003 $FFFF $0000 $0004 $0005 $FFFE $0000
Token meanings are:
PRINT statement
Count of expressions to evaluate for PRINT
Index of variable a in vartab (pushed on eval stack)
End of first expression (pop result from eval stack and print it)
Index of variable b in vartab (pushed on eval stack)
Index of constant 2 in vartab (pushed on eval stack; I stored constants as variables with impossible to generate names for runtime simplicity and stack evaluation consistency)
Multiplication operator (pop two values, multiply them, push result)
End of second expression (pop result from eval stack and print it)
Index of variable c in vartab (pushed on eval stack)
Index of constant 3 in vartab (pushed on eval stack)
Addition operator (pop two values, add them, push result)
End of third expression (pop result from eval stack and print it)
The value of the tokens are not exact, but it illustrates the idea. My tokenized / pcode scripts could use an entire 64 KB for tokens, or 32,768 tokens max, plus extra memory for the vartab. Each token was stored at a given offset in the token array, so GOTO/JMP (whether used explicitly or as part of a structured construct) would just update the PC to the new token index.
Obviously, on an 8 bit system, you'd want more compact tokens. And for an interactive interpreter, you'd have to store them more like you suggested so that as lines were added, edited, and deleted, the program would still be runnable. My language had a script compiler that translated from a text file to the tokenized form which meant I didn't have to worry about LISTing it back out later (though some enterprising third party did reverse engineer it and created a decompiler for it which was able to do a great job, as I did not attempt to write an optimizing compiler). I remember the README file that came with it being critical of how simplistic it was and how it should have been done in a different way to generate machine code, but that wasn't the point. We wanted to let people customize the system. And later when we ported PCBoard to OS/2, most scripts continued to run unmodified because we didn't tie them to a particular machine representation. The only exceptions were if people used the PEEK & POKE and similar statements because ... I was able to do it, so why not? ?
https://www.99-bottles-of-beer.net/language-ppl-562.html gives a longer if not completely useful example.
Edit: Note that I started off with one byte tokens. The initial ambition was to create a simple scripting language. As things happen at times, as people saw it able to do more and more, a lot of "what if we added..." features crept in. Enough to overflow a byte. A more complex encoding would have been possible, but I was still a fairly young developer (I was about 25 when I did this project) and either didn't consider it and/or didn't have the time to rework the portions I had already written.