--continued---
In the first post, we picked a little BASIC bitmap graphics program to convert from the Plus/4 to the X16. In the second post, we accomplished the 'functional' adaptation.
Using turbo mode on the Plus/4 emulator, we found it took hours (2h, 42m+) to run the full plotting sequence. But even before any optimization, our newly minted X16 version got things done in 'just' 11 minutes and 4 seconds.
In the third post (just above), we did a first optimization pass with some 'easy' things (organization, shortening variables, simple efficiency tweaks). We also paid homage to the gods of the BASIC parsing engine in a number of ways. The resulting was we reduced the runtime down to 10 minutes and 6 seconds. Nearly a minute faster! We're well on the way to my aspiration of a time less than 8 minutes, 50 seconds.
Now we are going to turn to the more profitable optimizations. Although they're more effective when available, we do these
last because the changes involve shuffling things around and (probably) multi-purposing certain variables by treating them almost like registers. The resulting BASIC program will be more difficult to understand, and a lot harder to debug if its not working perfectly already.
Before we get started, lets implement a quality of life improvement. When you switch from the bitmap (mode $80) BACK to the 40 or 80 column text-only screen (modes $00 or $02), the current ROM doesn't re-initiate the layers to the background color and it's sort of unreadable. I'd guess that will change in some future update. For now, we can do what's necessary manually from the command line when we switch to 40 columns, by the combination:
'SCREEN$00:COLOR1,6:CLS'
Let's add that to our program listing. We'll want to pause the output before that switch, so let's put a line that will 'GET' the current keypress (we will stick it in a string variable called 'KEY$' -- which is really just 'KE$' since as Commodore BASIC only cares about the first two characters of a variable name), and use an 'IF/THEN' to force a re-run of this line until there is a key pressed. Then execution can fall through to our little clean up command (lets also 'LIST' the program there) once the user presses something.
.
9 GETKEY$:IF KEY$="" THEN9
10 SCREEN$0:COLOR1,6:CLS:LIST:END
I think I'll also add a bunch of down cursor commands in the quote where we 'PRINT' the time at the end so it appears near the bottom left like the Plus/4 version. Now, let's finally dig in to some more optimizations, continuing our fancy internet 'lettered list'.
E. Making it faster WHILE making it look better.
Our first optimization is for both speed and a visual improvement. Take a look at this portion of a screen shot of the program's output:
See that vertical banding/glitching? I've stuck in a bunch of gaudy, ugly, arrows to line up with the vertical irregularities in the pixel plots below them, so you can see what I'm talking about. That 'banding' is present in both our X16 adaptation and the original Plus/4 version. And its not due to flaws in the underlying curve creation calculations. (The Sine/Cosine functions from the C64, Plus/4 and C128 (and, thus, X16) BASIC are accurate to many more bits of precision than could possibly make any difference when translating the results to a relatively low-res plotting area of 320x200). No, those glitches are what happen when you re-scale in a very crude way: i.e., by starting with pixel locations that were based off directly sampling specific intervals along the underlying raw curve function, and multiplying them against a coefficient that doesn't divide evenly into the sample intervals.
Those of you who have ever tried to use scaling settings in DOSBOX to resize a game's display from the original MSDOS resolution to a bigger window -- but where the scaling factor isn't an even divisor into both the original and destination resolutions -- will have seen something similar.
The 'scaling down' multiplications here are the ones where the original author introduced variables 'X1' and 'Y1' which were assigned by multiplying '.85' and '.9', respectively against the 'X' and 'Y' variables previously generated directly from sampling intervals at points on the underlying functions. For several reasons, I believe this scaling was a last minute addition by the original author (or, more likely, the editors of the magazine where the program listing appeared), but that's besides the point. You're probably not reading this thread for speculative software anthropology.
Bottom line: (1) the vertical banding looks bad; and (2) the extra multiplications in the inner loop make things slower (they cause extra parsing and the introductions of variables 'X1' and 'Y1' are bound to result in serious performance losses -- maybe even in part due to variable management issues we'll talk at some point before I'm done with this excessive odyssey about a very short program ).
Let's remove those crude scalers. The relevant program lines go from this:
5 X=I+O/B+E:Y=S-O/B+F:X1=.85*X:Y1=.9*(G-Y):IFX1<.ORY1>GTHEN7
6 PSETX1,Y1,.:LINEX1,Y1+1,X1,G,1
to this:
5 X=I+O/B+E:Y=G-(S-O/B+F):IFX<.ORY>GTHEN7
6 PSETX,Y,.:LINEX,Y+1,X,G,1
As you can see, we just rolled the 'subtraction from variable G' into the original expression to calculate 'Y'; also, we had to change the 'PSET' and 'LINE' commands to pass them 'X' and 'Y' instead of passing the scaled 'X1' and 'Y1' variables, which we've eliminated.
There's a risk here: When we first got the program working, we did bounds checking only as to the left and bottom of the bitmap. If this was scaled down by the original author to keep it from spilling off the right side of the screen when plotted at full size, we'll get an error. Well, let's 'RUN' it and see:
Shazam! In my view, that looks WAY better. No vertical banding/glitching, and it fills the screen nicely. It came close but didn't spill off the right side so we didn't pay for our omitted right side bounds checking with an error.
And, wait a second, what's this? OH.MY.WOW.... Look at that time!!! 8 minutes, 40 seconds. I knew we would also be speeding things up, but this shaved a
huge amount off the elapsed time from the prior version! So much that we've already bested our goal of getting this sucker below 8 minutes and 50 seconds!
I guess we're done... right?
Just kidding. I've got more to do, and now I'm really curious (and, if I'm honest, tending toward unnecessarily manic) to see to just how far we can take this. Now, part of the speed up here is for obvious reasons: For each of the 33,175 iterations of our inner loop, we eliminated 24 byte parses; 4 variable fetches; and two multiplication operations. But I suspect the biggest part of the optimization was a consequence of that variable management issue I keep hinting at. Sorry, that's still for very last part of my write up, but I'll get there (eventually).
F. EVICTING calculations, parts of expressions, and slow things from the Inner Loop: I have to confess a math error. In the prior post I said the outer loop indexed by 'O' runs 128 times. I did the math in my head and overlooked the need to add 1 because 'FOR/NEXT' loops run from their starting point to ending point 'both inclusive.' If the interpreter adds the interval value to the current indexing variable and the result is less than OR equal to the ending value specified in the 'FOR' statement, it will loop again. So there are actually 129 outer loop iterations.
As you will recall, we did a temporary alteration of the code listing in a previous post to add a counter, and this helped us figure out that the inner loop runs 33,175 iterations over the entire run of the program. While not every iteration will plot to the screen (if coordinates fail our limited bounds checking, the program skips the line with the 'PSET' and 'LINE' commands), every other operation inside that inner loop must be parsed and the operations performed those same 33,175 times. If we do the math (129/33175)*100 we see the implication: The things happening in the inner loop represent about 99% of the impact in terms of speed for the work done by the program. That means if we can move stuff out of the inner loop to the outer loop (or even just kick them from the inner loop, period), we will create some real time savings. Let's do that now.
I'm going to walk in detail through the first two of these, and then more briefly describe the rest. Here's the program lines containing our current outer/inner loops so you can refer to it when reading what follows:
3 FORO=-ATOASTEPB:A%=.5+SQR(A*A-O*O)
4 FORI=-ATOA%:Q=SQR(I*I+O*O)*D:R=COS(Q)+COS(Q+Q)+COS(5*Q):S=R*C
5 X=I+O/B+E:Y=G-(S-O/B+F):IFX<.ORY>GTHEN7
6 PSETX,Y,.:LINEX,Y+1,X,G,1
7 NEXT:NEXT
i. Lets start with that 'O*O' calculation we see up there. Both our outer loop AND our inner loop calculate 'O squared' and we'll change both, but its in the inner loop where it will help. Remember 'O' is now the indexing variable for our outer loop, so the result of that calculation changes every time the outer loop iterates. But since the INNER loop runs nestled inside the outer loop, the value of 'O' for any full run of the inner loop (as part of any given iteration of the outer loop) remains the same. We know that 'O*O' is a discrete operation that can be plucked out because (a) its value does not depend on the inner loop variable (or some other variable that is altered earlier within the inner loop); and because (b) the hierarchy of evaluating expressions in BASIC puts multiplication/division at a higher priority (done first) before addition/subtraction, so that the 'O*O' expression can be taken out and replaced with a variable without altering the evaluation of the larger expression in which it is a part. Thus, there is NO REASON to perform 'O*O' repeatedly some 33,175 times in the inner loop! Here's what we're going to do:
Let's introduce a new variable 'J' and, immediately after the start of the outer loop, lets set it with 'J =O*O'. That calculation being located right there means every time the outer loop iterates and its indexing variable, 'O', changes, our 'J' value will get updated before proceeding to the rest of the outer loop and before commencing that full sequence run of the inner loop. So we can replace the 'O*O' calculations in both the outer AND inner loops with a reference to variable 'J'. For every single iteration of the inner loop (that's 33,175 times!) we have now replaced two variable fetches of variable 'O' with a single variable fetch of variable 'J'; and we have skipped that costly 'multiplication' operation.
ii. Same thing with the calculation 'O/B'. Division is even slower than multiplication. We already know 'O' doesn't change during the run of an inner loop. And we know from a few posts up that the variable called 'B' is a constant that was set at program initialization, and not altered anywhere else in the program. So TWICE within every iteration (all 33,175 of 'em) of the inner loop, we have been doing 'expensive' division operations that don't need to be in the inner loop. Every time it does that operation in a new iteration within the same run of the inner loop the result of that calculation is the same as the prior iteration! We'll evict these calculations by introducing another new variable, 'T' this time, and set it early within the outer loop as: 'T=O/B'. Then we replace the two 'O/B' operations in the inner loop with 'T'. Again, we can be confident its ok to do so because the hierarchy of operations means that, before our alteration, the 'O/B' division operations were evaluated at those locations prior to the interpreter proceeding with the neighboring addition/subtractions against the result. Therefore replacing that calculation with a variable fetch does not affect the evaluation of the larger expression.
Here's where we are in terms of updates so far to the lines containing our loops:
3 FORO=-ATOASTEPB:J=O*O: T=O/B: A%=.5+SQR(A*A-J)
4 FORI=-ATOA%:Q=SQR(J +I*I)*D:R=COS(Q)+COS(Q+Q)+COS(5*Q):S=R*C
5 X=I+T+E:Y=G-(S-T+F):IFX<.ORY>GTHEN7
6 LINEX,Y+1,X,G,1:PSETX,Y,.
7 NEXT:NEXT
So hopefully you get the gist of what we're doing here. I immediately see two more obvious -kick something out completely - evictions and I'll explain them only briefly before we try another 'RUN' to see how we're coming along:
iii. Swapping the order of the commands in the 'PSET' and 'LINE' part of the listing allows us to eliminate the need for a '+1' math operation that gets parsed and processed 33,175 times by the 'LINE' command in the inner loop. It works because the 'LINE' is plotting background color and the 'PSET' is plotting the foreground (black) pixel point. Only instead of starting the line (erasure) operation one pixel below ('Y+1') the current coordinates, we just perform the background line first starting right at the current 'Y' and then plot over that background pixel with the 'PSET' operation setting our new pixel after that. Same exact visual result, but we saved the parse of two bytes ('+' and '1') and an addition operation for every run of the inner loop that results in a pixel plot (i.e., most of the iterations) by doing it this way:
6 LINEX,Y,X,G,1:PSETX,Y,.
iv. After the program sets variable 'R' to the result of that stack of Cosine functions, It sets another variable 'S' to the resulting value held in 'R' times the value of variable 'C' -- a constant set during initiation of the program. That 'R' variable is holding an intermediate value; and 'R' is not used anywhere else in the program except to be multiplied by the constant 'C' to put a value into 'S'. By and large, you should not store intermediate values where it is not necessary to do so. It turns out that parsing one set of 'added' parentheses during expression evaluation is still faster than parsing the extra variable fetches and stores (and net bytes of additional character parses to perform the 'R=S*C' expression), which means we can simply simply set 'S' to be the result of that stack of Cosines AND the multiplication by 'C' all evaluated as a single expression, and without using variable 'R' to hold an intermediate value. We put the stack of Cosines in parentheses since that whole mess needs to be evaluated first, and the result multiplied by 'C' though. Still, bye bye variable 'R' -- and here's the revised line:
4 FORI=-ATOA%:Q=SQR(J +I*I)*D:S=C*(COS(Q)+COS(Q+Q)+COS(5*Q))
OK, so here's our current listing, which we should save:
Now, let's 'RUN' it and see where we are on the question of speed:
Not bad. 8 minutes, 2 seconds elapsed. Another good time savings, and by now its safe to say that we've literally blown the doors off our initial optimization goals. Lets move on. Its like they say in that dumb party game: 'How low can you go?'
G. Transforming a Branch Intended to Avoid Errors into a Time Savings: Lets start this particular optimization with an analogy that, believe it or not, is surprisingly useful in its parallels with our program:
Suppose you've decided to make Tacos for your grand-child, niece/nephew, (or whatever) who is visiting you for the weekend. You'll need to take the ground meat out; fry it up with taco seasonings; chop onions; shred lettuce; dice tomatoes; mince cilantro; grate cheese; heat taco shells; plate everything up; and sit down with the kiddo to partake in the feast. But there are a couple wrinkles: (i) You don't have any idea if the kiddo even LIKES tacos, but you know the youngster is a spoiled brat who will have a tantrum if (s)he doesn't like that type of food; and (ii) you have never actually prepared tacos, since you normally go for TV dinners or take-out when you don't have company -- so its unclear if your tacos will actually be tasty or even edible or not.
In our program, we KNOW that we also have two wrinkles within that inner loop. The way the set of curve progressions work, (i.e., in order for the image to develop correctly), the initial parts of several curves actually generate horizontal ('X') coordinates that are less than zero and, thus, off the left side of the bitmap. Also (having watched this plot out too many times, now) I can tell you that we also know that later during the run -- although not quite apparent from the final image that appears on screen -- the formulas do generate some vertical axis ('Y') coordinates that are greater than 199 and would be below the bottom of the screen (causing an error were it not for the fact our program does do bounds checking for whether 'Y' is greater than 199).
Lets call that boundary issue with the x axis the question of whether the kiddo likes tacos; and the boundary issue with the y axis the question of whether the food actually turns out to be any good once you're done cooking. This is an apt comparison if you look at our program listing carefully : The final 'X' coordinate is generated within the inner loop by just a couple of simple additions to the value of the indexing 'I' variable from that inner loop. That's like the easy question of saying: "Hey, kiddo, do you like home-made tacos?" Simple. But the final 'Y' coordinate requires (within every iteration of the inner loop mind you,) a big string of slow calculations: There's a square root ('SQR()') command, which is slow indeed; there's a bunch of multiplications, a stack of Cosines (each one being quite slow) and some additions/subtractions at the end, all involving a significant number of variable fetches. Sheesh. That is a LOT of work. But that's why deriving the 'Y' is like the cooking/meal prep process in my analogy. You don't know how THAT part of things turned out, (if the meal is any good or the 'Y' coordinate is on or off of the screen), until you've done all the hard work.
Thus far, we we have had our program test BOTH of those 'wrinkles' -- the 'X' and 'Y' boundary issues -- with a single 'IF/THEN' check that evaluates both conditions. And we do this ONLY right before the plotting. With the 'Y' coordinate, we don't have much of a choice: Just like the question of whether the chef cooked up a decent tasting taco plate, we only know what 'Y' looks like after doing all that work. But with the 'X' coordinate, we have been doing our test VERY late in the game. Its like getting all the food out, frying up the meat, cutting the veggies and getting everything plated up and only then asking the spoiled little brat if he/she even likes tacos. That's the worst possible time to ask that question, because you've already done all the work.
Hence this optimization. Let's re-arrange things. First, let's add yet another variable to the outer loop. This time, we'll take the part of the 'X' derivation from inside the loop and simplify it further. In the eviction process above, we already got rid of the 'O/B' and replaced it with variable 'T' in the inner loop. What we're going to do here is go one step further in the 'X' part of the inner loop calculations. We create variable U directly after setting 'T' in the outer loop, and we do so as follows: 'U=T+E' which has the effect of evicting variable 'E' out of our inner loop too because now 'U' substitutes for 'O/B+F' (previously, it has been summed up with the current value of the inner loop indexing variable 'I' and the result of that previously evicted 'O/B' calculation).
Now we can split up our heretofore conjoined bounds checking 'IF/THEN' statement. Immediately after the beginning of our inner loop, we'll do the (very simple/cheap) calculation to generate 'X' so we can immediately see if 'X' is off the screen. We know it happens some percentage of time during a full run. But I will confess here, that before taking the trouble to write up and test this part, I made it my business to KNOW the answer. ( I again made some temporary changes to the program listing to actually count this up). The answer is that during a full program sequence, the program skips plotting a pixel because the 'X' value is less than 0 a total of
1176 times, or about 3.5% of the 33,175 inner loop iterations.
What a waste! We have been 'cooking the entire meal' before checking this issue. We have been doing all those slow calculations to derive 'Y' when the 'X' value would prevent a pixel plot anyway. With the changes we just made, i.e., by evaluating 'X' at the outset of the inner loop, we know if it fails the bounds check we can jump immediately to the next iteration of the loop. If 'X' is bad, it doesn't matter what 'Y' is, which means this approach skips all those very slow calculations eventually used to derive 'Y', as it turns out, in about 3.5 percent of the inner loop iterations.
And that's the lesson of this subsection: Wherever possible, consider whether you can avoid doing costly/slow work before a branch when it is predictable that a material percentage of the time the branch will in fact occur and make that work moot.
OK, here's the resulting listing:
As you can see, if the test 'X' reveals the pixel is 'in bounds (i.e., no branch because of a failed bounds check of that easy to prepare 'X' value), program execution can then fall through to all those slow calculations used to derive 'Y.' Then we have to put in a second 'IF/THEN' statement to bounds check 'Y' and prevent plotting when it exceeds the bottom coordinate of the screen, but that's easy. In the end, our code has grown: We have added a second 'IF/THEN' -- and that has a real cost. But, I am actually betting on the 33,175 extra 'IF/THEN' statements actually having
less total cost, than the time savings we will achieve by avoiding doing the work of that extremely expensive 'Y' derivation in 1,176 of the iterations of our inner loop. In other words, the hope is the 'net' result will be more speed.
Am I right? Well, let's 'RUN' it and see! Here we go:
OK. We're now at 7 minutes, 46 seconds. It was worth it in my view. We've saved another 16 seconds off the total time with just this change.
Well, then. So far, optimizing is going even better than expected, and already we are seeing the program runtime clock in more than a minute faster than the goal we set at the beginning of the optimization process.
That said, this post is getting really long, so I am sorry to say I'll have to save the write up about BASIC variable storage and fetching -- and the changes we can make to optimize things further in view of that topic -- for the next time.
You'll have to scroll past a couple replies below, but it will be in this same thread. Sorry I couldn't get to it in this post, but its sort of interesting stuff and I want to take the time / space to do it justice. Stay tuned.