BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...
Well, yesterday was longer than I expected. Sheesh. I don't like medical stuff, but its especially annoying if you have a thing that gets bumped several hours at a time because the docs are all in the ER working on a motor vehicle accident. Oh and "no, you still cannot eat or drink anything but water, sir." Geez, just cancel my thing and let's reschedule but don't keep me in there extra hours just waiting. Ah, well, I guess its medicine in a medium sized city in 'current year' USA. I'm very sorry I didn't get back until late, and then had errands and other things today so I had to delay writing this up.
Before I start.. Scott: Awesome work on some benchmarking for variable cost/overhead. One possible addition would be to add a line (before everything else) where you initialize a bunch of dummy scalar variables (DA,DB,DC,DD,DE,DF,DG,DH, etc). You can use 'DIM' to initialize regular/scalar variables -- just call the command with the variables separated by commas, e.g., 'DIM DA,DB,DC,DD,DE' -- which should let you fit 20+ dummy variables in one line. You can REM it out to duplicate the results you already have, and then un-REM the line and see how scalar variable performance changes when the ones in your benchmarking routines effectively become later-initialized scalers and have to do the 'nope, nope' stuff in view of all the others earlier in pecking order....
Anyway, now to the thought process and the optimization that got the time of the 'Proteus' demo down to 3 minutes and 47 seconds....
As I said above, the key was looking at the "precursor" expression that creates the value that goes into the COS() functions in the cosines stack (with the result of all that, in turn, getting tweaked into the 'y' pixel coordinate.
EDITED: Yes, I know what I call the 'precursor' is the 'angle' that gets fed to the cosine functions. But its not an angle, not really, because if the original author wanted to convert degrees to radians (s)he ought to have multiplied it by pi/180 (approx. 0.01745) and the coefficient used was 0.0327 which is not an even multiple of that or anything. Its just a magic number the original author fiddled with until the output matched what was wanted. So call it angle if you wish, but its just the 'precursor' to me! At any rate...
In the "0652" time version of the demo above, that precursor was at the beginning of line 4:
Y=SQR(J+I*I)*D
We obfuscated it a bit in prior optimizations by kicking part of the calculation up into the outer loop (J=O*O), but in essence the expression generates the square-root of the sum of two squares and then multiplies the result by a magic number the original author came up with to get everything to work on the screen the way (s)he wanted.
The things that are squared are the values of our two loop indexing variables (outer 'O' and inner 'I').
My brainwave was to look at this and think about a basic mathematical identity: N squared is equal to (-N) squared. A positive times a positive is a positive; a negative times a negative is also a positive....
This is helpful because each of the main loop indexing variables swing between negative and positive. The outer loop runs from -144 to 144 with a step of 2.25. So there are 129 different values of 'O' as it iterates. Digging deeper, that step increment is an even divisor into 144, which means the that indexing variable will have 64 negative values, the 0 value, and 64 positive values that are the SAME (save for the change in sign) as the 64 negative values. But what that means is that the outer loop variable effectively makes only 65 possible different contributions to that square-root of the sum of two squares expression. Whether 'O' is 144 or -144; -141.75 or 141.75, the value of 'O' participates in the 'precursor' expression only by being multiplied by itself. The sign of 'O' is effectively rendered immaterial for THAT part of the calculations in this program.
Likewise, the inner loop runs from -144 to some varied positive number calculated based on the value of the outer loop indexing variable, with an increment step of 1. The way the endpoint is calculated, it cannot be higher than +144. Once again, that means that (keeping in mind the loop also swings through the 0 value while iterating), although the inner loop runs up to 288 iterations, the inner loop indexing variable can only supply 145 different contributions to that 'square root of the sum of two squares' expression... The sign of the value of 'I' is immaterial to the precursor-to-cosines stack combo, since (-I) * (-I) is the same as I * I.
Synthesizing all this, we start to see something: Our inner loop has been calculating the extremely expensive combination of what I've been calling the ' precursor' and that triple cosines stack (and a couple multiplication operations into the bargain) to get the 'y' coordinate just under 32,000 times over the course of running the program -- i.e., whenever the more cheaply calculated 'x' coordinate lands within the valid coordinates of the bitmap screen. HOWEVER, our examination above has just demonstrated the calculations in the "evaluate precursor, feed it to the cosine stack" sequence are always going to be performed on what can effectively be treated as a maximum of 9425 (65 x 145) combinations of inputs.
It means we might be able to make a lookup table! A 65x145 table.
But whether we can implement such a table and what we put in it is limited by memory constraints.
For example, if it turns out that we MUST store floating point values for this to work, we cannot use a regular array and this idea probably dies. A float array takes 5 bytes per entry to store floating point values. So, um, yikes! That's 9425 * 5 = 47,125 bytes, which is way WAY more than the 39K of memory available to BASIC. I suppose we could make an artificial fixed point array using POKES/PEEKS storing two bytes in memory for each value, one for the integer portion and one for the fractional portion. But frankly, it seems to me using BASIC to calculate, store and fetch both pieces might be too costly since all our 'overhead' would have to be done in BASIC. (It gives me an idea for a 'USR()' routine however for some future project).
I didn't bother trying that, because it turns out we can store our 9425 values in an integer array. In C64 style BASIC, that means we have a range of -32768 to 32767 for each value, and it takes only 2 bytes of memory per entry. That's less than 20K of memory cost here. We can do that.
But... Does it actually help us to only be able to store integers? To see about this, let's reason it out. Looking at the plotting sequence as a whole, it is apparent that the 'y' pixel coordinate is calculated and left (as one of my updates above mentions) as a float and fed to the graphics commands as-is, and the commands simply disregard the fractional portion.
So can we throw out the fractional portion earlier?
Let's see. The final 'y' coordinate for a pixel location is evaluated as:
Y=G - ({n} - T + F)
{n} is the result of that 'precursor' and cosines stack. 'F' is a constant that acts like the "vertical position" knob on an old analog TV -- changing it moves the entire image up or down within the bitmap. So far, I've kept it at the original author's suggested value of 90.
'T' is the outer loop index value 'O' divided by the step increment 'B' (constant of 2.25). Which is to say T ranges from -64 to +64. OH! LOOK! Now we see how, despite the sign of the outer loop indexing variable being discarded through calculation of the precursor expression, the sign (pos/neg) of 'O' nevertheless DOES matter and plays into the ultimate 'y' coordinate pixel generation because 'T=O/B' gives different result in terms of sign depending on whether 'O' is positive or negative. Nice.
That also tells us that what I refer to as {n} above (the result of the precursor and cosines stack) is the thing we should store in our table, if possible.
OK, so reasoning further: The result from the precursor/cosines stack (including the multiplication by the original author's constant 'C') is a float. It is a small float that we can be sure will not exceed BASIC's 'valid' integer variable range of -32768 to 32767. We know this because we can see that after the tweaks by subtracting T' (value in range -64 to 64) and adding 'F' (const. of 90), and then taking the result and subtracting it from 'G' (const. 199), it will, most of the time, yield a valid vertical pixel coordinate between 0 and 199.
That's great news. It means we can throw that {n} into our integer table. BUT. Doing so results in disregarding the fractional portion earlier than normal. Here's what that does and what we need to do to deal with it. Suppose that prior to using our table (i.e., in the earlier program versions) some {n} value, after getting tweaked by 'T' and 'F' winds up yielding a result of 19.378. That number then got subtracted from 199 to give a 'y '(vertical) pixel coordinate of 179.622. Feeding that to our graphics commands resulted in the fractional component being disregarded, meaning a pixel would be plotted at vertical position 179.
But, now suppose we stick the result of {n} into an integer array variable. It loses its fractional component right at that moment. Which means that same value after adjustment by 'T' and 'F' winds up returning '19' instead of 19.378. And subtracting 19 from 199 yields a vertical pixel coordinate of 180. It's off-by-one. That's the only problem. We need to subtract an additional '1' from 199 to get the 'correct' pixel coordinate due to the 'premature' removal of the fractional component when we use the table.
Remember that 'F' constant I just mentioned (the one that's like a 'vertical position' setting)? We will change it from 90 to 91, and that will do the trick.
Well, that's it.
That's how it works. I'll put up a screen shot of the listing tomorrow. Also, I'm going to put up a combined 'final' program in the 'demos' section, probably with a little menu that lets you pick between three versions of this sucker. I'll try to include one of the very early versions from this thread (if I can find one I saved), the 1.0 version that was the best performer prior to this breakthrough, and then this version '2.0' update. Having all three in one program will hopefully help folks make a close study of the listing and see how things evolved.
[EDITED: Ok, the 3-in-1 demo is in 'demos' in the downloads section]
One final word: This latest optimization really improved performance, but only by eating literally HALF of all the memory available to BASIC. That's often a thing... sometimes you can gain a lot of speed by throwing memory at a problem. And for a free-standing demo like this, hey... whatever works. Many famous C64 demo scene machine language programs pre-calculate all sorts of stuff using every possible available byte of memory. But if this were just a small part of another program, it would clearly be a bridge too far. That said, given what the values in the table look like, I have an idea for an alternative implementation that might be able to get away using only 10K of memory... but there's no way it would be this fast.
Cheers!
ETA: Here's that listing....