New community dev tool uploaded: BASIC PREPROCESSOR

Scott Robison · Post by **Scott Robison** » Wed Apr 21, 2021 8:26 am

BASIC PREPROCESSOR

BASIC PREPROCESSOR allows one to create Commodore BASIC programs with a normal text editor without line numbers. Features:

Much as strings begin and end with a quotation mark ("), macro constructs begin and end with a commercial at sign (@). This means that you cannot include @ in a macro, but otherwise any character may be used.

A label can be defined on a line by itself as @+LABEL@.

A label can be referenced after a GOTO or GOSUB as @-LABEL@ (including ON statements).

A long variable name can be used as @!NAME@.

A preprocessed comment can be used as @' whatever text you want @. These comments are not written to the PRG file.

Any leading whitespace on a line is removed before writing the code to the PRG file.

The preprocessor (probably) requires an emulator built from the master github branch.

The program is written almost completely in BASIC. The one exception has to do with tokenization. Normally as you enter lines of BASIC the computer will translate them into a compressed tokenized form, and this is necessary for the programs to be usable. In order for BPP.PRG to create tokenized BASIC programs, it has a small machine language routine in golden RAM that converts from plain text to tokenized form. The tokenized form is written to the output PRG file.

Here is a super simple example called SIMPLE.BPP.

Quote

@' THIS IS A COMMENT '@

@'

THIS IS

A MULTILINE

COMMENT

'@

@+LOOP@

IF @!COUNT-VAR@ > 10 THEN PRINT "DONE COUNTING": END

GOSUB @-INC-VAR@

GOTO @-LOOP@

@+INC-VAR@

@!COUNT-VAR@ = @!COUNT-VAR@ + 1

RETURN

An animated GIF demonstrates the process of using the program.

Submitter

Scott Robison

Submitted

04/21/21

Category

Dev Tools

mobluse · Post by **mobluse** » Wed Apr 21, 2021 3:47 pm

I think this is great. Does it handle labels in ON...GOTO and ON...GOSUB?

How do you enter the BPP files? since the built in editor requires line numbers.

Elektron72 · Post by **Elektron72** » Wed Apr 21, 2021 3:51 pm

2 minutes ago, mobluse said:

How do you enter the BPP files? since the built in editor requires line numbers.

I assume you would use a program like X16 Edit that can edit text files without line numbers.

Scott Robison · Post by **Scott Robison** » Wed Apr 21, 2021 4:52 pm

1 hour ago, mobluse said:

I think this is great. Does it handle labels in ON...GOTO and ON...GOSUB?

How do you enter the BPP files? since the built in editor requires line numbers.

It should, though I've honestly not tried it yet. Just the simplest stuff. Let me check! {time passes} Yes!

The program works by looking for @ symbols. Just as a normal string is delimited with quotation marks, labels are delimited with @ symbols in my preprocessor. So when it encounters a label outside a string, it replaces it with the line number (or something approximating the line number).

As for how to enter a BPP file, yes, you'd need a text editor like x16 edit or some such. This is the one place I cheated. x16 edit isn't working for me (I suspect because I'm running bleeding edge emulator I built after cloning github). So I used a text editor on my Windows machine, saved it into a sdcard image, and then ran it from there.

So this is not ready for primetime probably if one wants to use it in r38. It is just barely ready if you are using bleeding edge emulator. Or at least it is for me.

Scott Robison · Post by **Scott Robison** » Wed Apr 21, 2021 5:05 pm

Obviously there are better ways to do this, but back when I had my C=64, I didn't have the benefit of an extra computer on which to do dev work. I'm already cheating a little bit there as I admit above using an external text editor, but I could have created the text file without it, so I'm allowing it. Text editing is the only external task I'm allowing myself thus far.

So my first substantive program is BASIC PREPROCESSOR. It is pure basic plus one ML routine, which I did completely with the MONITOR in the emulator then copied the bytes to DATA statements.

My second program will be BASIC EDIT. Not a competitor to x16 edit, but something very simple that can edit small text files. I will write it in BASIC PREPROCESSOR syntax so it will serve as the first "big" example of BASIC PREPROCESSOR. My intent is to write the smallest possible editor I can that allows me to add text, remove text, save files, and load files. Once I have that done I will try to do all my dev "natively" in the emulator (which is a contradiction in terms, but it is hopefully clear enough in context).

I do this not because it is the "best way" ... just because if I'm going to go retro, let's go retro!

Stefan · Post by **Stefan** » Thu Apr 22, 2021 5:29 am

I think this is an interesting solution to the shortcomings of the built-in BASIC.

I haven't had time to try it out yet, but I will.

As to the source code file format of any programming language that is developed for the X16 - be it BASIC, FORTH or assembly - it would be great if we used a common standard so that the source code may be edited in any present or future editor available on the platform.

The plain text PETSCII or ASCII file is the reasonable solution in my mind.

Plain text file formats are, however, not exactly the same on different computer platforms. This is most evident when thinking about line break encoding. We have at least the LF (ASCII 10) used in todays Linux/MacOS, the CR (ASCII 13) used by legacy MacOS, and of coarse the CRLF (ASCII 13+10) used by Windows.

Commodore 8 bit computers did not recognize ASCII control character 10. Even though there were a lot of custom solutions, the closest we have to a standard for line break encoding on Commodore machines is a single CR. That is also used by the VolksForth compiler that is available to us.

Scott Robison · Post by **Scott Robison** » Thu Apr 22, 2021 11:49 am

6 hours ago, Stefan said:

I think this is an interesting solution to the shortcomings of the built-in BASIC.

I haven't had time to try it out yet, but I will.

As to the source code file format of any programming language that is developed for the X16 - be it BASIC, FORTH or assembly - it would be great if we used a common standard so that the source code may be edited in any present or future editor available on the platform.

The plain text PETSCII or ASCII file is the reasonable solution in my mind.

Plain text file formats are, however, not exactly the same on different computer platforms. This is most evident when thinking about line break encoding. We have at least the LF (ASCII 10) used in todays Linux/MacOS, the CR (ASCII 13) used by legacy MacOS, and of coarse the CRLF (ASCII 13+10) used by Windows.

Commodore 8 bit computers did not recognize ASCII control character 10. Even though there were a lot of custom solutions, the closest we have to a standard for line break encoding on Commodore machines is a single CR. That is also used by the VolksForth compiler that is available to us.

I agree that a standard would be good. If you download my plain text BPP.BAS file and look at it, you'll see it is "ASCII/PETSCII" compatible with the one exception that it does use CRLF line endings because I edited it on Windows. My solution to that problem, because line ending wasn't a huge consideration for me, is that I consider CR *or* LF to be a line ending character. So when preprocessing BASIC text that originated on Windows, I wind up with line numbers 0, 2, 4, 6, etc. Had I used just one or the other, I should have used each line number (0, 1, 2, 3, etc). Except for empty lines, which I do not emit.

Because I've not tested it extensively, I am not sure what it would do with some corner cases such as "non empty line that only has space characters in it". I think, because of the way I leverage the BASIC crunch routine to tokenize the line, it might skip that, but it wasn't important to my proof of concept so I didn't dig deeper, esp since the program is dependent on the bleeding edge github, I think.

Scott Robison · Post by **Scott Robison** » Thu Apr 22, 2021 12:31 pm

7 hours ago, Stefan said:

Plain text file formats are, however, not exactly the same on different computer platforms. This is most evident when thinking about line break encoding. We have at least the LF (ASCII 10) used in todays Linux/MacOS, the CR (ASCII 13) used by legacy MacOS, and of coarse the CRLF (ASCII 13+10) used by Windows.

Commodore 8 bit computers did not recognize ASCII control character 10. Even though there were a lot of custom solutions, the closest we have to a standard for line break encoding on Commodore machines is a single CR. That is also used by the VolksForth compiler that is available to us.

Given that the platform is embracing more characters than the 8-bits of old, I think the right "text file source code EOL standard" should be "one or more characters in the set CR or LF". So it could handle plain old 8-bit CR terminated lines, Unix-y LF terminated lines, and DOS/Windows CRLF lines. Then tokenizers / parsers could easily skip blank lines as meaningless (unless of course someone decided that a blank link should be a syntactic construct, in which case they'd want to be more judicious).

As for ASCII vs PETSCII, it would be nice if there was some sort of a BOM character like exists for Unicode that could be used as the first character in a file to identify the encoding.

For those who do not know (I'm not trying to talk down to anyone, we just all approach this with different backgrounds), original Unicode was a strictly two byte per character encoding. There was no UTF-8. The problem presents itself: Are my characters in little endian or big endian order? U+FEFF was defined as a "Zero Width No-Break Space" character which means it is just white space, so easily ignored by most language processing software. U+FFFE (the reversed form of U+FEFF) was defined at some point as "noncharacter" that should not appear in unicode text. So U+FEFF became the simple way to determine which character encoding was in use.

With PETSCII vs ASCII, we don't have the byte ordering issue, but sniffing the encoding would still be useful. According to https://www.pagetable.com/c64ref/charset/ we have several flavors of SPACE in PETSCII:

$20: Normal Space Character (SP in either ASCII or PETSCII)

$A0: No-Break Space (NBSP in either IEC-8859-15 or PETSCII, the two native encodings on x16)

$E0: No-Break Space (NBSP in PETSCII but a-grave in 8859-15)

None of those are particularly useful for differentiating between ASCII vs PETSCII.

Another solution is what many editors support, which is to include a magic comment as the first line of source code that encodes metadata about the file. I think this is our best bet. In BASIC source code like my BPP.BAS file, I could include a first line like:

REM ENC=PETSCII EOL=CR

To signal the compiler that my file is in PETSCII encoding and uses CRLF as the end of line marker. In C one might create a line like:

/* ENC=8859-15 EOL=LF */

In ASM code maybe:

; ENC=ASCII EOL=CRLF

And so on. I would suggest that the "de facto" standard for x16 source:

1. Looks at the beginning sequence of characters up to the first CR or LF character.

2. The characters should be unshifted alphabetic characters so that uppercase ASCII and uppercase PETSCII (in graphics charset) map to the same set of character codes $41 - $5A. If in mixedcase PETSCII, it would be lowercase letters.

3. Valid encodings that should be recognized by all x16 compatible software should be ENC=PETSCII, ENC=ASCII, ENC=8859-15.

4. Valid end-of-line types that should be recognized by all x16 compatible software should be EOL=CR, EOL=LF, EOL=CRLF.

5. The valid character set of these NAME=VALUE pairs should be limited to alphabet (codes $41-$5A), digits, equal sign and hyphen with spaces appearing before and after each.

6. This allows for easy extension to include new attributes we might not consider now that would be generally useful, or for individual software to define their own custom NAME=VALUE pairs for their own use.

This is just stream of consciousness ideas that does not obligate anyone to define a rigidly enforced standard. But it could be useful.

Stefan · Post by **Stefan** » Thu Apr 22, 2021 2:01 pm

Yes, it's not necessarily easy to get this right.

I think a magic comment would work fine for your BASIC preprocessor. That is, if you decide to support different options.

I'm sure there are valid historical reasons for the double byte CRLF on Windows, but it's hard to see the benefit of that encoding today. It only makes parsing the file more complicated in my opinion.

Scott Robison · Post by **Scott Robison** » Thu Apr 22, 2021 6:54 pm

4 hours ago, Stefan said:

Yes, it's not necessarily easy to get this right.

I think a magic comment would work fine for your BASIC preprocessor. That is, if you decide to support different options.

I'm sure there are valid historical reasons for the double byte CRLF on Windows, but it's hard to see the benefit of that encoding today. It only makes parsing the file more complicated in my opinion.

History. Mechanical printers / teletype / whatever often required a CR to return the carriage (the print head did not move necessarily, but the paper like in a ancient typewriter), then a LF to scroll the paper one more line. Windows does it because DOS did it. DOS did it because CP/M did it. CP/M wasn't about translating control codes from an abstract set of commands to a device specific set of commands, which is why many printers in the day had jumpers that you could set to customize how it would respond to things like ESC sequences and CR & LF & etc.

I actually still use stand alone CR frequently when writing console mode / terminal mode programs to continuously update the same line with updated status information in a long running process.