7 hours ago, Stefan said:
Plain text file formats are, however, not exactly the same on different computer platforms. This is most evident when thinking about line break encoding. We have at least the LF (ASCII 10) used in todays Linux/MacOS, the CR (ASCII 13) used by legacy MacOS, and of coarse the CRLF (ASCII 13+10) used by Windows.
Commodore 8 bit computers did not recognize ASCII control character 10. Even though there were a lot of custom solutions, the closest we have to a standard for line break encoding on Commodore machines is a single CR. That is also used by the VolksForth compiler that is available to us.
Given that the platform is embracing more characters than the 8-bits of old, I think the right "text file source code EOL standard" should be "one or more characters in the set CR or LF". So it could handle plain old 8-bit CR terminated lines, Unix-y LF terminated lines, and DOS/Windows CRLF lines. Then tokenizers / parsers could easily skip blank lines as meaningless (unless of course someone decided that a blank link should be a syntactic construct, in which case they'd want to be more judicious).
As for ASCII vs PETSCII, it would be nice if there was some sort of a BOM character like exists for Unicode that could be used as the first character in a file to identify the encoding.
For those who do not know (I'm not trying to talk down to anyone, we just all approach this with different backgrounds), original Unicode was a strictly two byte per character encoding. There was no UTF-8. The problem presents itself: Are my characters in little endian or big endian order? U+FEFF was defined as a "Zero Width No-Break Space" character which means it is just white space, so easily ignored by most language processing software. U+FFFE (the reversed form of U+FEFF) was defined at some point as "noncharacter" that should not appear in unicode text. So U+FEFF became the simple way to determine which character encoding was in use.
With PETSCII vs ASCII, we don't have the byte ordering issue, but sniffing the encoding would still be useful. According to
https://www.pagetable.com/c64ref/charset/ we have several flavors of SPACE in PETSCII:
$20: Normal Space Character (SP in either ASCII or PETSCII)
$A0: No-Break Space (NBSP in either IEC-8859-15 or PETSCII, the two native encodings on x16)
$E0: No-Break Space (NBSP in PETSCII but a-grave in 8859-15)
None of those are particularly useful for differentiating between ASCII vs PETSCII.
Another solution is what many editors support, which is to include a magic comment as the first line of source code that encodes metadata about the file. I think this is our best bet. In BASIC source code like my BPP.BAS file, I could include a first line like:
REM ENC=PETSCII EOL=CR
To signal the compiler that my file is in PETSCII encoding and uses CRLF as the end of line marker. In C one might create a line like:
/* ENC=8859-15 EOL=LF */
In ASM code maybe:
; ENC=ASCII EOL=CRLF
And so on. I would suggest that the "de facto" standard for x16 source:
1. Looks at the beginning sequence of characters up to the first CR or LF character.
2. The characters should be unshifted alphabetic characters so that uppercase ASCII and uppercase PETSCII (in graphics charset) map to the same set of character codes $41 - $5A. If in mixedcase PETSCII, it would be lowercase letters.
3. Valid encodings that should be recognized by all x16 compatible software should be ENC=PETSCII, ENC=ASCII, ENC=8859-15.
4. Valid end-of-line types that should be recognized by all x16 compatible software should be EOL=CR, EOL=LF, EOL=CRLF.
5. The valid character set of these NAME=VALUE pairs should be limited to alphabet (codes $41-$5A), digits, equal sign and hyphen with spaces appearing before and after each.
6. This allows for easy extension to include new attributes we might not consider now that would be generally useful, or for individual software to define their own custom NAME=VALUE pairs for their own use.
This is just stream of consciousness ideas that does not obligate anyone to define a rigidly enforced standard. But it could be useful.