INC BASIC (for lack of a better name)

Scott Robison · Post by **Scott Robison** » Fri Apr 02, 2021 8:05 am

Preface: I'm not trying to teach anyone anything, I'm just trying to commit some thoughts to writing and soliciting feedback.

On the one hand we have Commodore / Microsoft BASIC. It tokenizes keywords into one or two byte tokens, but otherwise the line is stored in memory as typed. Execution is slow, but it is easy to list for the programmer to read, and the interpreter reparses the line at run time for every execution.

At the other extreme we have machine language which is very hard to read. Assembly makes it a little easier, but it is still difficult to read.

In between we have various languages that can be compiled to some form of code, either machine code or p-code, but there is a "painful" compile / run / test cycle.

I think a more elaborate BASIC interpreter could improve on the execution speed of traditional BASIC by doing more work while crunching, trading a little less efficiency as the programmer is writing the code for improved efficiency after typing run. This BASIC would still allow a highly interactive experience for the programmer, without a time consuming edit / compile / link / run cycle.

As an example, consider a line of BASIC:

10 PRINT 200+300*500+700

BASIC crunches it down to the following bytes (in hex):

17 08 (pointer to next line)

0A 00 (line number 10)

99 20 (PRINT token followed by a space)

32 30 30 (digits 2 0 0)

AA (add token)

33 30 30 (digits 3 0 0)

AC (multiply token)

35 30 30 (digits 5 0 0)

AA (add token)

37 30 30 (digits 7 0 0)

00 (end of line)

To execute that after typing run, BASIC has to read 22 bytes. First it notes that the keyword is PRINT. Skip the whitespace. Convert the digits 200 to a floating point number. Push it on the stack. Read the add operator which means there has to be another expression to the right. Convert the digits 300 to a floating point number. Push it on the stack. Read the multiply operator which means there is another expression to the right, and multiply has higher precedence than add. Convert the digits 500 to a floating point number. Read the add operator which means there is another expression to the right, and add has lower precedence than multiply, so finish the multiply by popping 300 and 500 from the stack, multiplying them, and pushing the result (150000) back on the stack. Convert the digits 700 to a floating point number. Push it on the stack. Read the end of line marker. Pop 150000 and 700, add them, and push the result (150700) back on the stack. Pop 200 and 150700, add them, and push the result (150900) back on the stack. Now we have a single expression, so print it.

Imagine an alternative implementation that crunches the line to bytes as follows:

xx yy (pointer to next line; the exact value doesn't matter at the moment)

0A 00 (line number 10)

10 (length of "listable" crunched line in bytes [16])

99 20 (PRINT token followed by a space)

01 C8 (literal byte value 200)

AA (add token)

02 2C 01 (literal word value 300)

AC (multiply token)

02 F4 01 (literal word value 500)

AA (add token)

02 BC 02 (literal word value 700)

08 (length of "executable" crunched line in bytes)

02 (offset of literal byte value 200)

05 (offset of literal word value 300)

09 (offset of literal word value 500)

AC (multiply)

AA (add)

0D (offset of literal word value 700)

AA (add)

99 (PRINT)

Listing the code becomes more complex (slower) because there is more "uncrunching" to do. Entering a line becomes more complex (slower) because there is more "crunching" to do. Running that line of code has to read 25 bytes instead of 22 bytes, but it doesn't have to convert strings to numbers which results in much less machine code being executed. In my imaginary code above I'm using bytes and words to store the literal numbers, but we could store them in another larger but still preprocessed format (such as floating point) that is much faster for the interpreter to process at run time, rather than continually converting text to numbers.

Of course, this benefit can be achieved in large part by storing numeric constants in variables which only have to be converted once, and people do that already in BASIC when they are trying to optimize their code.

I'm not suggesting this is the exact alternative format that should be used for INC BASIC. Some of my thoughts include:

1. A full screen editor to edit blocks of code by name, rather than requiring line numbers.

2. Crunching a line would identify all the "tokens" in the line of text and store them in a table that includes variables, constants, labels. In this way variable creation would be part of editing the code, rather than a step that takes place at run time as variables are encountered for the first time.

3. Constant expressions could be evaluated at crunch time.

4. Labels are basically just constant expressions, so there would not need to be any slow linked list search of where to goto or gosub next.

5. Inline assembly for small speed critical parts would be nice to have.

6. Support more than just floating point expressions, since byte math is faster than word math is faster than floating point math.

In essence, the full screen editor would "compile" the text into a tokenized form, updating data structures as it went so that when it came time to run the program, all it had to do is reset variables to zero / default values.

I welcome feedback. If you think it is the worst idea in the history of ideas, that's fine, I'm just thinking it could be a nice middle ground between existing BASIC and the machine language monitor, especially if it could be located in a ROM bank. The way to make code faster is to execute less of it, and I think something like this is at least an interesting thought experiment.

BruceMcF · Post by **BruceMcF** » Fri Apr 02, 2021 4:51 pm

Following on a previous comment in a General Chat post thread, if the text is stored in a file, then uncrunching is not an issue. That is, part of the point of the early Microsoft Basic approach was to have a line-editable program which could be run at any time a line was done being edited, with the whole program a thing that could be saved and then reloaded as a block with very rudimentary mass storage. So if the system baseline includes a hierarchical file system with sufficiently rapid access that the runtime no longer has to be the same as the entire editable program, the runtime interpreted code can generated in memory line by line, but the source kept in a parallel file which includes the entered lines as well as reference information.

It could even still be line-editable, to work headless context over the serial port. It's just that the line editing would be, (1) modify existing / write new line, (2) crunch interpretable runtime (3) add back reference data to the text of the line.

To still be Basic, no variables or labels have to be declared, so there is also a/some linked list(s) or binary tree(s) of unresolved variable references, and running when the list(s) is/are not empty generates a syntax error.

$18 length of listable line in bytes [24]

"10 PRINT 200+300*500+700"

$05 length of stored descriptors

xx yy (pointer to the runnable line in memory)

80 (label token)

00 02 (offset from start of listable line, length of label)

[NB. Suggests end of file is an empty listable line with no stored descriptors, $00 $00]

Crunched line:

99 (PRINT token, formatting not needed in runnable code)

01 C8 (literal byte value 200)

AA (add token)

02 2C 01 (literal word value 300)

AC (multiply token)

02 F4 01 (literal word value 500)

AA (add token)

FF (execute line token)

Literal tokens write the value into the operand stack, type, exponent if floating, 32bytes integer/mantissa, operand tokens push their executable address onto the operand, on encountering the line-end token, the operand stack is executed.

[Copy and paste all the disclaimers in the original post at this point.]

rje · Post by **rje** » Fri Apr 02, 2021 5:15 pm

9 hours ago, Scott Robison said:

1. A full screen editor to edit blocks of code by name, rather than requiring line numbers.

2. ...a [symbol] table that includes variables, constants, labels. In this way variable creation would be part of editing the code...

3. Constant expressions could be evaluated at crunch time.

4. Labels are basically just constant expressions...

5. Inline assembly for small speed critical parts would be nice to have.

6. Support more than just floating point expressions...

(#7) In essence, the full screen editor would "compile" the text ...

(#8) ...I'm just thinking it could be a nice middle ground between existing BASIC and the machine language monitor, especially if it could be located in a ROM bank. The way to make code faster is to execute less of it, and I think something like this is at least an interesting thought experiment.

#1 perhaps inadvertently made me think of Smalltalk's editor, where the image file is presented only as sets of named methods, rather than large hunks of files and their contents. That appeals to me. But it might not be what you were talking about.

#2 #3 It sounds like tokenizing a line into a p-code (if that's incorrect then OK), which is the way I like it.

#6 I've become convinced by you guys that floats aren't necessary. But, I do think longs are sometimes needed, even though they're slow.

Generally I agree to all of your points.

Scott Robison · Post by **Scott Robison** » Fri Apr 02, 2021 5:39 pm

21 minutes ago, BruceMcF said:

Following on a previous comment in a General Chat post thread, if the text is stored in a file, then uncrunching is not an issue. That is, part of the point of the early Microsoft Basic approach was to have a line-editable program which could be run at any time a line was done being edited, with the whole program a thing that could be saved and then reloaded as a block with very rudimentary mass storage. So if the system baseline includes a hierarchical file system with sufficiently rapid access that the runtime no longer has to be the same as the entire editable program, the runtime interpreted code can generated in memory line by line, but the source kept in a parallel file which includes the entered lines as well as reference information.

It could even still be line-editable, to work headless context over the serial port. It's just that the line editing would be, (1) modify existing / write new line, (2) crunch interpretable runtime (3) add back reference data to the text of the line.

Perhaps this is the best way to do it. My thought of having the crunching routine do more work to create a more optimized executable form of the program was the spread out the work of "compiling" to one or a few lines at a time. If someone attempted to write a 20 KB program (lets say average line length of 40 characters, so 500 lines of BASIC) the time to prepare to run would be more obvious than if it was spread out over 500 individual presses of the enter key. Using a completely made up number, if it takes 10 ms per line to crunch the program into its executable form, that would require 5 seconds from the time one typed RUN until it was actually running. 10 ms is probably far longer than it would actually take ... just thinking out loud.

But we're dealing with a machine that won't have the exact same constraints as a C64. More ROM space. More banked RAM. Faster storage and faster CPU clock. Maybe all those add up to a model where the complexity of this concept isn't as valuable as it would have been 30 to 40 years ago.

44 minutes ago, BruceMcF said:

To still be Basic, no variables or labels have to be declared, so there is also a/some linked list(s) or binary tree(s) of unresolved variable references, and running when the list(s) is/are not empty generates a syntax error.

Agreed. My thought was to have every reference of a variable add it to a table if it isn't already present, so that crunching ensures the variable exists. Ditto with labels. If you have GOTO label, where the label hasn't yet been defined, you at least get a placeholder for the eventual label value. Each variable might include a ref count so that when a line with a ref is deleted, and the ref count hits zero, the space can be reclaimed.

When I designed a scripting language for PCBoard, my initial plan had been: 1, edit a script text file; 2, load the script which would compile it to tokenized form at load time; 3, execute the transient token stream. This was running on 16 bit DOS and the load / compile / execute time was slow enough that I switched to a pre-compiled form, but we didn't attempt to have a script editor built into the BBS. My thoughts are basically how to merge the two ideas so that "compile" was done gradually over the period of the program being written.

43 minutes ago, BruceMcF said:

Literal tokens write the value into the operand stack, type, exponent if floating, 32bytes integer/mantissa, operand tokens push their executable address onto the operand, on encountering the line-end token, the operand stack is executed.

That's essentially what I had in mind. Tokens would either be constants, variables, or executable (operators, functions, statements). I was thinking it would be stack based so that the statement actually is the last token so multiple statements are easily "crunched" into one sequence of tokens without a need to have extra delimiters / markers.

Scott Robison · Post by **Scott Robison** » Fri Apr 02, 2021 5:45 pm

25 minutes ago, rje said:

#1 perhaps inadvertently made me think of Smalltalk's editor, where the image file is presented only as sets of named methods, rather than large hunks of files and their contents. That appeals to me. But it might not be what you were talking about.

#2 #3 It sounds like tokenizing a line into a p-code (if that's incorrect then OK), which is the way I like it.

#6 I've become convinced by you guys that floats aren't necessary. But, I do think longs are sometimes needed, even though they're slow.

Generally I agree to all of your points.

You've got it generally how I see it. I've never used smalltalk, but my thought was the LIST command without args would actually give you a dictionary of the fragments / functions / subroutines / whatever. LIST name would list the named fragment. EDIT name would pull up a full screen that would allow the named fragment to be edited, and when exiting it would do the bookkeeping to generate the p-code from the listable text. Floats usually aren't necessary, except for when they are. They have their place, but it shouldn't be the default data type if possible.

rje · Post by **rje** » Fri Apr 02, 2021 6:13 pm

Well heck isn't that how we're supposed to write our C code? I do a list to see my files, which as a responsible and experienced programmer of course naturally only exactly represent one concept ever. Then I invoke the trusty vi editor to actually edit the file I was looking for.

The bonus step would be a thing at the termination of vi which recompiles the p-code.

Almost similar to the way Python p-codifies a source ONLY IF there's not a fresh precompiled file already there.

=> which is another option, probably worse because now we'd have to keep track of another file at run-time.

The Smalltalk experience may vary from version to version, but the classic view is a "system browser" (e.g. dictionary list of packages, classes, methods). Select one and you get the code for that method, period. Easy to switch from one method to the next, but you never see the whole shebang. And that was actually a good experience, since you had to think in terms of encapsulation anyhow.

The presentation here is graphic, but the concept is the same as you mentioned:

LIST to show the program segments or whatever.

EDIT ... to edit a segment in the Beefy Editor Tokenizer Thing.

BruceMcF · Post by **BruceMcF** » Sat Apr 03, 2021 3:07 am

On 4/3/2021 at 1:39 AM, Scott Robison said:

That's essentially what I had in mind. Tokens would either be constants, variables, or executable (operators, functions, statements). I was thinking it would be stack based so that the statement actually is the last token so multiple statements are easily "crunched" into one sequence of tokens without a need to have extra delimiters / markers.

This is stack based ... it's just a slightly different style of converting infix notation to postfix execution, which makes it easier for the editor to line up the text of the line with the crunched representation. All disclaimers apply, not arguing its the perfect solution, yadda, yadda, yadda. It's similar to some systems written in Forth for executing infix arithmetic, and what the cruncher is doing is rearranging them to get the precedence correct when the operation stack is executed. Then ")" is the same as end of line, "$FF" above, while "(" is the "execute tokens" routine, so maybe if you had:

10 PRINT 100+(300*200)/(10*5)

You get the crunched line:

99 (PRINT token, formatting not needed in runnable code)

01 64 (literal byte value 100)

AA (add token)

FE (execute tokens token)

02 2C 01 (literal word value 300)

AC (multiply token)

01 C8 (literal byte value 200)

FF (execute operations token)

AD (divide token)

FE (execute tokens token)

01 0A (literal byte value 10)

AC (multiply token)

01 05 (literal byte value 5)

FF (execute operations token)

FF (execute operations token)

Maybe have a JMP (table,X) executor, with 256 codes available if you have two pages of jump vectors:

NextToken: INY

FirstToken: LDA (line),Y : ASL : TAX : BCS + : JMP (table1,X) : + JMP (table2,X)

If the operations stack is the hardware stack, then return address vectors are pushed for the tokens and "execute operations" is just RTS, with most operations also ending in an RTS ... with exceptions like the "interpret next line" routine that lives on the bottom of the operations stack and "execute tokens", where the token action is to push the address of NextToken-1, and of course branching routines.