BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Tutorials and help articles.

Articles will be approved by our staff before posting. This is to ensure that the community gets the highest quality possible content.
Snickers11001001
Posts: 140
Joined: Wed Jan 20, 2021 6:43 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Snickers11001001 »


Well, yesterday was longer than I expected.   Sheesh.    I don't like medical stuff, but its especially annoying if you have a thing that gets bumped several hours at a time because the docs are all in the ER working on a motor vehicle accident.   Oh and "no, you still cannot eat or drink anything but water, sir."    Geez, just cancel my thing and let's reschedule but don't keep me in there extra hours just waiting.    Ah, well,  I guess its medicine in a medium sized city  in 'current year' USA.   I'm very sorry I didn't get back until late, and then had errands and other things today so I had to delay writing this up.   

Before I start..  Scott:   Awesome work on some benchmarking for variable cost/overhead.  One possible addition would be to add a line (before everything else) where you initialize a bunch of dummy scalar variables (DA,DB,DC,DD,DE,DF,DG,DH, etc).   You can use 'DIM' to initialize regular/scalar variables -- just call the command with the variables separated by commas, e.g., 'DIM DA,DB,DC,DD,DE'  -- which should let you fit 20+ dummy variables in one line.     You can REM it out to duplicate the results you already have, and then un-REM the line and see how scalar variable performance changes when the ones in your benchmarking routines effectively become later-initialized scalers and have to do the 'nope, nope' stuff in view of all the others earlier in pecking order....  

Anyway, now to the thought process and the optimization that got the time of the 'Proteus' demo down to 3 minutes and 47 seconds.... 

As I said above, the key was looking at the "precursor" expression that creates the value that goes into the COS() functions in the cosines stack (with the result of all that, in turn, getting tweaked into the 'y' pixel coordinate.   

EDITED:   Yes, I know what I call the 'precursor' is the 'angle' that gets fed to the cosine functions.   But its not an angle, not really, because if the original author wanted to convert degrees to radians (s)he ought to have multiplied it by pi/180 (approx.  0.01745) and the coefficient used was 0.0327 which is not an even multiple of that or anything.  Its just a magic number the original author fiddled with until the output matched what was wanted.   So call it angle if you wish, but its just the 'precursor' to me!     At any rate...

In the "0652" time version of the demo above, that precursor was at the beginning of line 4:

Y=SQR(J+I*I)*D

We obfuscated it a bit in prior optimizations by kicking part of the calculation up into the outer loop (J=O*O),  but in essence the expression generates the square-root of the sum of two squares and then multiplies the result by a magic number the original author came up with to get everything to work on the screen the way (s)he wanted.   

The things that are squared are the values of our two loop indexing variables (outer 'O' and inner 'I').    

My brainwave was to look at this and think about a basic mathematical identity:     N squared is equal to  (-N) squared.   A positive times a positive is a positive; a negative times a negative is also a positive.... 

This is helpful because each of the main loop indexing variables swing between negative and positive.    The outer loop runs from -144 to 144 with a step of 2.25.   So there are 129 different values of 'O' as it iterates.   Digging deeper, that step increment is an even divisor into 144, which means the that indexing variable will have 64 negative values, the 0 value, and 64 positive values that are the SAME (save for the change in sign) as the 64 negative values.    But what that means is that the outer loop variable effectively makes only 65 possible different contributions to that square-root of the sum of two squares expression.   Whether 'O' is 144 or -144;   -141.75 or 141.75, the value of 'O' participates in the 'precursor' expression only by being multiplied by itself.   The sign of 'O' is effectively rendered immaterial for THAT part of the calculations in this program.   

Likewise, the inner loop runs from -144 to some varied positive number calculated based on the value of the outer loop indexing variable, with an increment step of 1.  The way the endpoint is calculated, it cannot be higher than +144.    Once again, that means that (keeping in mind the loop also swings through the 0 value while iterating), although the inner loop runs up to 288 iterations, the inner loop indexing variable can only supply 145 different contributions to that 'square root of the sum of two squares' expression...   The sign of the value of 'I' is immaterial to the precursor-to-cosines stack combo, since (-I) * (-I) is the same as I * I.    

Synthesizing all this, we start to see something:   Our inner loop has been calculating the extremely expensive combination of what I've been calling the ' precursor' and that triple cosines stack (and a couple multiplication operations into the bargain)  to get the 'y' coordinate just under 32,000 times over the course of running the program  -- i.e., whenever the more cheaply calculated 'x' coordinate lands within the valid coordinates of the bitmap screen.  HOWEVER, our examination above has just demonstrated the calculations in the "evaluate precursor, feed it to the cosine stack" sequence are always going to be performed on what can effectively be treated as a maximum of 9425 (65 x 145) combinations of inputs.   

It means we might be able to make a lookup table!    A 65x145 table.   

But whether we can implement such a table and what we put in it is limited by memory constraints.  

For example, if it turns out that we MUST store floating point values for this to work, we cannot use a regular array and this idea probably dies.  A float array takes 5 bytes per entry to store floating point values.   So, um, yikes!  That's  9425 * 5 = 47,125 bytes, which is way WAY more than the 39K of memory available to BASIC.  I suppose we could make an artificial fixed point array using POKES/PEEKS  storing two bytes in memory for each value, one for the integer portion and one for the fractional portion.  But frankly, it seems to me using BASIC to calculate, store and fetch both pieces might be too costly since all our 'overhead' would have to be done in BASIC.    (It gives me an idea for a 'USR()' routine however for some future project).  

I didn't bother trying that, because it turns out we can store our 9425 values in an integer array. In C64 style BASIC, that means we have a range of -32768 to 32767 for each value, and it takes only 2 bytes of memory per entry.   That's less than 20K of memory cost here.   We can do that.   

But...  Does it actually help us to only be able to store integers?   To see about this,  let's reason it out.   Looking at the plotting sequence as a whole, it is apparent that the 'y' pixel coordinate is calculated and left (as one of my updates above mentions) as a float and fed to the graphics commands as-is, and the commands simply disregard the fractional portion.   

So can we throw out the fractional portion earlier?  

Let's see.   The final 'y' coordinate for a pixel location is evaluated as:

Y=G - ({n} - T + F)

{n} is the result of that 'precursor' and cosines stack.  'F' is a constant that acts like the "vertical position" knob on an old analog TV -- changing it moves the entire image up or down within the bitmap.   So far, I've kept it at the original author's suggested value of 90.   

'T' is the outer loop index value 'O' divided by the step increment 'B' (constant of 2.25).   Which is to say T ranges from -64 to +64.   OH! LOOK!   Now we see how, despite the sign of the outer loop indexing variable being discarded through calculation of the precursor expression, the sign (pos/neg) of 'O' nevertheless DOES matter and plays into the ultimate 'y' coordinate pixel generation because 'T=O/B' gives different result in terms of sign depending on whether 'O' is positive or negative.   Nice.     

That also tells us that what I refer to as {n} above (the result of the precursor and cosines stack) is the thing we should store in our table, if possible.   

OK, so reasoning further:   The result from the precursor/cosines stack (including the multiplication by the original author's constant 'C') is a float.   It is a small float that we can be sure will not exceed BASIC's 'valid' integer variable range of -32768 to 32767.   We know this because we can see that after the tweaks by subtracting T' (value in range -64 to 64) and adding 'F' (const. of 90), and then taking the result and subtracting it from 'G' (const. 199), it will, most of the time, yield a valid vertical pixel coordinate between 0 and 199.

That's great news.   It means we can throw that {n} into our integer table.   BUT.   Doing so results in disregarding the fractional portion earlier than normal.    Here's what that does and what we need to do to deal with it.   Suppose that prior to using our table (i.e., in the earlier program versions) some {n} value, after getting tweaked by  'T' and 'F' winds up yielding a result of 19.378.    That number then got subtracted from 199 to give a 'y '(vertical) pixel coordinate of 179.622.   Feeding that to our graphics commands resulted in the fractional component being disregarded, meaning a pixel would be plotted at vertical position 179.   

But, now suppose we stick the result of {n} into an integer array variable.   It loses its fractional component right at that moment.    Which means that same value after adjustment by 'T' and 'F' winds up returning  '19' instead of 19.378.   And subtracting 19 from 199 yields a vertical pixel coordinate of 180.    It's off-by-one.   That's the only problem.   We need to subtract an additional '1' from 199 to get the 'correct' pixel coordinate due to the 'premature' removal of the fractional component when we use the table. 

Remember that 'F' constant I just mentioned (the one that's like a 'vertical position' setting)?   We will change it from 90 to 91, and that will do the trick. 

Well, that's it.           

That's how it works.  I'll put up a screen shot of the listing tomorrow.   Also,  I'm going to put up a combined 'final' program in the 'demos' section, probably with a little menu that lets you pick between three versions of this sucker.  I'll try to include one of the very early versions from this thread (if I can find one I saved), the 1.0 version that was the best performer prior to this breakthrough, and then this version '2.0' update.   Having all three in one program will hopefully help folks make a close study of the listing and see how things evolved. 

[EDITED:   Ok, the 3-in-1 demo is in 'demos' in the downloads section]

One final word:     This latest optimization really improved performance, but only by eating literally HALF of all the memory available to BASIC.   That's often a thing... sometimes you can gain a lot of speed by throwing memory at a problem.   And for a free-standing demo like this, hey... whatever works.   Many famous C64 demo scene machine language programs pre-calculate all sorts of stuff using every possible available byte of memory.    But if this were just a small part of another program, it would clearly be a bridge too far.     That said, given what the values in the table look like, I have an idea for an alternative implementation that might be able to get away using only 10K of memory...  but there's no way it would be this fast.        

Cheers!  

 

ETA:     Here's that listing....     

spacer.png

 

 

User avatar
desertfish
Posts: 1072
Joined: Tue Aug 25, 2020 8:27 pm
Location: Netherlands

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by desertfish »


Can you paste it in ascii here as well so i can run it easily myself? Thanks!

Snickers11001001
Posts: 140
Joined: Wed Jan 20, 2021 6:43 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Snickers11001001 »



9 minutes ago, desertfish said:




Can you paste it in ascii here as well so i can run it easily myself? Thanks!



Here's the newest version 2.0 with the precalc.   Let me know if you want a copy/paste of what I stuck in the downloads -- i.e., the three-in-one combo with the slow, fast, and fastest versions.  Cheers.

1 X=.: Y=.:I=.:G=199:T=.:U=.:F=91:J=.:D=.0327:C=20:O=.:A%=.:A=144:B=2.25
2 E=160:K=A*A:L=.5:DIMQ%(64,A):SCREEN$80:TI$="000000":GOSUB8
3 FORO=-ATOASTEPB:T=O/B:U=T+E:A%=L+SQR(K-O*O):J=ABS(T):FORI=-ATOA%
4 X=I+U:IFX>=.THENY=G-(Q%(J,ABS(I))-T+F):IFY<GTHENLINEX,Y,X,G,1:PSETX,Y,.
5 NEXT:NEXT
6 PRINT"\X97\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11\X11 TIME:"TI$"\X13\X11\X11"TAB(23)"THE PROTEUS DEMO"
7 FORI=-1TO0:GETK$:I=(K$=""):NEXT:SCREEN 0:COLOR1,6:CLS:LIST:END
8 PRINT"\X11\X1D\X1D\X1D\X97*PRE-COMPUTING FAST LOOKUP TABLE*":J=D:U=C:FORT=.TO64:X=T*B:Y=X*X
9 FORI=.TOA:X=SQR(Y+I*I)*J:Q%(T,I)=U*(COS(X)+COS(X+X)+COS(5*X)):NEXT:X=4*T+27
10 RECTX,C,X+2,27,11:NEXT:SCREEN $80:      RETURN


Scott Robison
Posts: 952
Joined: Fri Mar 19, 2021 9:06 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Scott Robison »



8 hours ago, Snickers11001001 said:




Before I start..  Scott:   Awesome work on some benchmarking for variable cost/overhead.  One possible addition would be to add a line (before everything else) where you initialize a bunch of dummy scalar variables (DA,DB,DC,DD,DE,DF,DG,DH, etc).   You can use 'DIM' to initialize regular/scalar variables just call the command with the variables separated by commas, e.g., 'DIM DA,DB,DC,DD,DE'  -- which should let you fit 20+ dummy variables in one line.     You can REM it out to duplicate the results you already have, and then un-REM the line and see how scalar variable performance changes when the ones in your benchmarking routines effectively become later-initialized scalers and have to do the 'nope, nope' stuff in view of all the others earlier in pecking order....  



Thanks. The numbers from my benchmarking are approximate at best and could stand to do a lot more teasing out of details. The reality is that things like variable lookups are going to vary a lot depending on how many variables exist (as you allude to) since they have to be scanned for sequentially (earlier defs are faster than later defs).

I think the hard part of using DIM to try to measure definition overhead is going to be one of not being able to loop it easily a number of times. With FOR NEXT loops I can easily embed something in the middle to time it, then factor out the amount of overhead time from the FOR NEXT itself to come up with estimates. Once a variable is defined with DIM it can't be redefined, unless CLR is used. But at the point CLR is used I think it would destroy the context of the FOR NEXT loop. {time passes} Yes, that is exactly what happens.

Now, I don't have to use a FOR NEXT loop, I could use PEEK & POKE & GOTO. But I think (and I don't have time to test it right now) that FOR NEXT bookkeeping is able to quickly locate the beginning of the loop again (as evidenced by the fact that I can make the FOR statement the second part of a line:

10 TI$="":FOR I=1TO1000

20 REM DO SOMETHING

30 NEXT:PRINTTI

To try to replace that with a GOTO means that any reverse GOTO has to begin searching from the first line of the program trying to find the intended destination, which means GOTO a higher line number takes more time than a lower line number. I fear the measurement of time (which is already less than perfect) becomes even less reliable at that point as too much overhead would be a part of the measurement.

Of course, this could all be addressed by writing simple little individual programs to measure each portion rather than one monolithic program that does it all. It would probably be better and more reliable. I might do that at some point if someone else doesn't do it before I can get around to it.

Edit: I didn't read carefully enough the first time. You were explicitly addressing the very topic I thought you alluded to. Sorry.

Snickers11001001
Posts: 140
Joined: Wed Jan 20, 2021 6:43 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Snickers11001001 »


So....   

I was browsing trigonometry tricks sites.    This was was actually research for another project.  But in the course of reading through all that stuff, I learned there's a thing called the cosine double-angle identity.   And its possible application here was immediately apparent.  

The identity provides that for any angle 'n', the value of cosine(2*n)  is equivalent to to the value of (2*cosine(n)^2)-1

That's sort of interesting because the cosine stack in my demo has both COS(n) and COS(n+n).    

To implement an optimization based on this, the result of expression I've called the precursor has to be prepared (call it 'n').   Then you get COS(n) and store that, lets say to 'X" [i.e., X=COS(n)];   Then the COSINE stack becomes (X+2*X*X-1+COS(5*n)).    Now there's only 2 COS() operations per cosine stack evaluation, instead of three.  

But the question is:   Is a single cosine operation in that stack so slow that it would potentially save time to replace it with a more convoluted expression that requires an extra variable store, two extra multiplications and several extra variable fetches?   Could all that additional work nevertheless be faster?  

Apparently, yes. 

Implementing the double-angle identity to get rid of one of the COS() calls actually did cut the 'precomputation' part of the revised Proteus Demo down from 1 minute and 39 seconds to 1 minute and 25 seconds, which brought overall time down to 3:32 from 3:47.     Despite increasing the number of 'BASIC' operations, the total cycles for all those added operations still wound up being less than the cycles incurred from a single COS() operation.   

So, sort of interesting, eh? 

 

 

 

 

 

Scott Robison
Posts: 952
Joined: Fri Mar 19, 2021 9:06 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Scott Robison »


It is interesting, and not surprising at all! Transcendental functions take a lot more computational power than any algebraic expression.

A few years ago I tried submitting an obfuscated C program to ioccc.org. I didn't try to do any transcendentals with it, but I did try to replace all math with bit operations. It would be "interesting" to create a complete library of functions that performed all math & logic (bit manipulation, adds, subtracts, multiplies, divides, exponentiation, transcendentals) in terms of simple universal gates like nand or nor.

Not terribly useful from a software development perspective, but interesting.

Snickers11001001
Posts: 140
Joined: Wed Jan 20, 2021 6:43 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Snickers11001001 »



1 hour ago, Scott Robison said:




t would be "interesting" to create a complete library of functions that performed all math & logic (bit manipulation, adds, subtracts, multiplies, divides, exponentiation, transcendentals) in terms of simple universal gates like nand or nor.



That would be pretty rough.   I make typos enough that I cringe to think of debugging that.   

Snickers11001001
Posts: 140
Joined: Wed Jan 20, 2021 6:43 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Snickers11001001 »


OPTIMIZING A SIMPLE PROGRAM PART TWO: 'FRACTAL BOOGALOO....'

This is a continuation of a discussion started in Software Library chat by SlithyMatt when he uploaded his neat BASIC routine that outputs an extreme zoom-in onto a portion of the Mandelbrot fractal, with a 256 color display and outputting in 320x240.     

That thread is here:

https://www.commanderx16.com/forum/index.php?/topic/1773-new-demo-uploaded-fancy-mandelbrot-set-zoomed-plot/

Its a very computationally intensive program, and of course, Matt's code is written to be straightforward and extremely easy to follow -- he's a great tutor/mentor which is how his youtube videos are so popular.  But that means he wasn't necessarily aiming for speed hacks.   That being the case, other members spit-balled about doing an optimized version in BASIC.   I indicated I would be inclined to give it a try, but then real life interrupted me and I'm only just getting around to it.  

Also (after I started working on this write-up yesterday) I noticed I'm really REALLY late to the game as Scott has done a full thread documenting his work trying to squeeze out the best performance from this neat BASIC routine.   His thread can be found here: 

https://www.commanderx16.com/forum/index.php?/topic/1780-optimizing-basic-fancy-mandelbrot-set-zoomed-plot-calculations/

I highly recommend reading both of those threads before starting with this write-up.  And in any event, I haven't done anything yet so it will give you something to do while you wait for my next post.

As I noted in Matt's thread, I started by adding some code to Matt's original to capture elapsed time of the original code.   Since I'm too lazy to research the correct pokes to restore the VERA to a proper text display after the 320x240 plotting, I used a trick to self-modify the BASIC listing to tuck the elapsed time into a REM statement.  Then you reset the X16,  use the 'OLD' command to restore the BASIC listing and will find the elapsed time in there.    Anyway, it turns out a full plot under the original takes 15 hours, 8 minutes and 49 seconds using ROM version R38 on the official emulator.  

In the posts that follow we're going to do my same optimization process as earlier in this thread....   

First, we'll go through the code to really make sure we GROK (that's old dude Heinlein inspired slang for 'understand intuitively or on a deep level') exactly what the original code is doing.   What are the loops?  What are the branches and why are they there?   What's the set-up like?   What's the output stage like?  What are things BASIC will do slower than it could?   As part of that post, I plan to show you some wonkiness with Commodore style (well, really all) floating point maths precision/rounding implementations that I expect could play into things when the numbers are really really small as with a fractal program. 

Second, I'll show you what I ultimately come up with in terms of an optimized version and go through my changes and why/how they work.   The only guaranty I'll make is that the changed program will be much more difficult to follow and understand when compared to the original.   

Finally, I'll do a full run on the emulator and see what my version accomplishes when all is said and done.   I won't lie, I'm a little doubtful that the empirical results will be anywhere near as good as with the little 'Proteus' demo earlier in this thread.   But we'll do our best and recognize that BASIC can only do so much with something that, at its core, grinds this much floating point math. 

Please do ask questions as we go, which will make the process more interesting and informative for everyone. 

Stay tuned.   

 

 

 

Snickers11001001
Posts: 140
Joined: Wed Jan 20, 2021 6:43 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Snickers11001001 »


Fractal Boogaloo writeup:     

A.  Cogitating and understanding.                    

NOTE:    I was going to stick in an aside about floating point precision here, but I think it will fit better later in the write up when I hopefully get to the end of the optimization process and show some screen shots. 

For now, lets start instead by considering this description from Wikipedia on programming the Mandelbrot...


Quote




In the escape time algorithm, a repeating calculation is performed for each x, y point in the plot area and based on the behavior of that calculation, a color is chosen for that pixel.



The x and y locations of each point are used as starting values in a repeating, or iterating calculation .... The result of each iteration is used as the starting values for the next. The values are checked during each iteration to see whether they have reached a critical "escape" condition, or "bailout". If that condition is reached, the calculation is stopped, the pixel is drawn, and the next x, y point is examined.



The color of each point represents how quickly the values reached the escape point. Often black is used to show values that fail to escape before the iteration limit, and gradually brighter colors are used for points that escape. This gives a visual representation of how many cycles were required before reaching the escape condition.



And here is the listing of Matt's program...  notice that he's implemented the 'escape' mechanism just as described in the article:

10 SCREEN 0

30 POKE $9F2D,$07

40 POKE $9F2F,$20

50 POKE $9F29,$11

100 FOR PY=0 TO 239

110 FOR PX=0 TO 319

120 XZ = PX*0.000036/320-0.747345

130 YZ = PY*0.000027/240+0.08784

140 X = 0

150 Y = 0

160 FOR I=0 TO 355

170 IF X*X+Y*Y > 4 THEN GOTO 215

180 XT = X*X - Y*Y + XZ

190 Y = 2*X*Y + YZ

200 X = XT

210 NEXT I

215 I = I - 100

216 IF I=256 THEN I=0

217 B = 0

218 OS = $4000

219 Y = PY

220 IF Y < 153 THEN GOTO 230

221 IF Y = 153 AND PX < 192 THEN GOTO 230

222 B = 1

223 OS = -192

224 Y = PY-153

230 VPOKE B,OS+Y*320+PX,I

240 NEXT PX

260 NEXT PY

The basic structure of the program is: [initialization] [rows_loop (pixels_loop {colors_loop} -output-)]

Initialization. Lines 10-50.  Its pokes at the VERA registers, setting up the screen.   No further comment/analysis necessary. 

Outer Loop: ('rows_loop')   Encompasses everything from line 100 to the end.   Indexed by 'PY' in the original code, this is basically the plotting of all the rows of pixels.   A single iteration of the rows_loop performs everything necessary to drop a complete row of pixels onto the screen.  The complete progression of all iterations of the rows_loop plots all rows of the plot.  Two loops are nestled within this main loop.

Nestled Middle Loop ('pixels_loop'):   Indexed by 'PX" in the original code, and comprising lines 110 to 240,  this goes through each pixel location within a row (moving left to right), figures out its color, and includes the output stage that actually puts a pixel on the screen.  A single iteration of the pixels_loop results in one pixel getting plotted within the then-current horizontal row.  A complete run of all iterations of the pixels_loop puts up all the pixels in a single row.   The colors_loop is nestled within.

 Nestled Inner-most Loop ('colors_loop'):  Indexed by 'I' in the original code, lines 160 to 210.   This is the code that grinds out the color selection for a pixel and uses the fractal math and that 'escape' function.  Each iteration updates x and y and checks to see if x^2 + y^2 exceeds a threshold of 4.    If it exceeds this escape threshold, program execution jumps out of the loop with the iteration count at that point (i.e., the then value of the 'I' indexing variable) available for the output stage of the pixels_loop to use to derive the color of the pixel.   If the colors_loop gets through all 356 iterations (0 to 355, both inclusive) without triggering the escape condition, execution falls into the pixels_loop which in turn forces a black pixel. 

 Output:  The pixels_loop has the output code.  At that point, the nestled colors_loop has either completed a full run or escaped early.  The output stage tweaks the value of the colors indexing variable "I" so its within the valid range of the 0 to 255 color VERA palette (with an if/then branch at line 216 above to force 'I' to 0 (black) if the 'colors' loop completed its entire run without an escape...  Remember the only time 'I' would get there with a value of 256 is if the colors_loop exited after a full run:  The 'NEXT' in the final iteration adds 1 to 355,  execution falls through without another iteration when the 'NEXT' command determines the result, 356, is greater than the specified endpoint of the loop; then line 215 subtracts 100 leaving 256).  Based on the current pixel coordinates, the output stage uses a series of branches in lines 220 and 221 to execute or jump over adjustments to the VPOKE 'bank' parameter and an offset value.  It then executes the VPOKE with some included maths to get the pixel of the selected color plotted at the right screen coordinate.

At 320x240 resolution, the program puts up 76,800 pixels.   For each pixel, the inner-most colors_loop iterates up to 356 times (0 to 355 both inclusive).    Matt tells us the code takes at least 100 iterations per pixel.  The program code above ALSO tells us that, since line 215 above subtracts 100 from the value of 'I' before eventually VPOKING it and the X16 would error out with "ILLEGAL QUANTITY" if the number being poked were negative.   I don't know if the fractal 'escape' scenario follows a standard distribution, but let's assume so.  That would mean (on average) about 178 iterations per pixel, which is to say the code in the colors_loop gets parsed and executed by the BASIC interpreter in the neighborhood of 13.67 million times to produce the full image.....     

By the way, it takes the X16 over 38 minutes to run a completely empty  FOR/NEXT loop that many times. (!!!) 

But what about the 'magic' numbers in that code?   What's with those calculations "XZ = PX*0.000036/320-0.747345" and "YZ=PY*0.000027/240+0.08784" ??

Here's what 's what:    The  "-.747345" and ".08784" are the "starting points" selected by Matt in the fractal x,y domain (which I think runs from -2 to 1 in the x/real axis, and from -1.5 to 1.5 or so in the y/imaginary axis).   The .000036 and .000027 are the size of the range being examined on each axis.  The 320 and 240 are the number of 'points' he wants to sample within the respective ranges (i.e., a number of points corresponding to the screen resolution).   So in the 'x' range, he wants to start at the number -.747345 and go from that number to an end point that is only .000036 larger, grabbing 320 evenly spaced numbers in that range.   Same for the 'y' range, except starting at .08784 and going to an end point .000027 away, grabbing 240 numbers in that range.    You might notice that   0.000036/320 and 0.000027/240 BOTH evaluate to 0.0000001125  (or 1.125E-07, as it will display on your X16).   That's how zoomed in this is, i.e., it takes that itsy bitsy increment added to the x,y to create a new starting point before running the colors_loop to see if /when the escape condition is met.  If the 'full' Mandelbrot were mapped to the size of a football field, the width of the piece Matt's program is mapping onto the screen would be a teeny tiny spot on that football field (i.e., fractions of a millimetre!)    Pretty impressive that a BASIC derived from 1980s Commodore/Microsoft code has enough floating point precision to handle things at this level of zoom.      

The thing that's striking here is the math is not actually all that complex.  There aren't any transcendental functions, no use of trigonometric specialty commands (although the escape threshold is, as a practical matter, checking to see if a particular x,y point in the fractal domain wanders out of a circle of radius 2 during accumulation of changes caused by the math of those inner-loop iterations).  There aren't even any square-roots, and the only exponentiation is a simple power of two.   But its doing the short set of simple operations over, and over, and over and over, and OVER.   

Also look at the calculations that adjust 'x' and 'y' in each iteration of that colors_loop.  The result of both x and y at the end of each iteration will have been affected by BOTH the values of x and y at the beginning of the iteration.  There's no partial results (except the very first iteration) that can be pre-calculated and reused.    While it is true that a Mandelbrot is mirrored in the y axis, that only leads to an optimization where the starting point and range go through the inflection point, which is not the case here.   That makes speeding up the most cycle-intensive part of the program hard, especially in BASIC.   

As Matt put it in his thread, "The core driver of the time is the math that just needs to happen."

B.  OPTIMIZATIONS ALREADY NOTED/PERFORMED BY OTHERS.  

1.  Evicting calculations. 

Scott's thread covers some of these.   For example, one of his early moves was to kick the divisions out of the loops and into to the initialization (.000036/320 and .000027/240 ALWAYS yield the same answers and do not need to be evaluated 76,800 times in the pixels_loop).   One of his updates kicked the entire derivation of YZ out of the pixels_loop and put it in the rows loop since that expression only changes as the row number being worked on changes.     And by the end of his optimization process, he reached the point where he simply added the result of the divisions (the .00036/320 and .000027/240) directly to the 'starting points' as adjusted by all prior increments.    

2.  Duplicated operations in the inner-most loop. 

Matt mentioned and Scott implemented a change based on the fact that both lines 170 and 180 of the original code calculated the squares of 'x' and 'y'.    In the original code, line 170 has 4 variable fetches and two multiplications to do this before getting to the evaluation of the IF/THEN condition.   Then, line 180 again did the same 4 variable fetches and two multiplications as part of the rest of the stuff there getting calculated.   By introducing two other variables at the outset of the loop and setting them to x*x and y*y respectively (two variable stores, four variable fetches and two multiplies), the colors_loop can then simply fetch the results for use in both the IF/THEN and expression in the following line.   Net result:   8 variable fetches and 4 multiplies are replaced by 2 variable stores, 6 variable fetches, and only 2 multiplications for that same part of the calculations.   As long as the variables are initiated by frequency of use (see my explanation on the prior page of this thread) the savings from getting rid of those two multiplications will be a net benefit.    An added benefit of this change noted by Scott is the ability to avoid having to store an intermediate value (held in "xt" variable in the original code).   We eliminate a store of 'xt' when it is calculated and the fetch of 'xt' when its value is put into 'x' in the original code, as well as the associated character parses by the BASIC interpreter.  

3.  Using VERA data port instead of VPOKES....

Matt's original thread linked above noted one of the most interesting optimizations from the X16 perspective:   Changing from VPOKES to use of the VERA data port.   The original code uses VPOKES, which means that, based on the current row/column pixel coordinates, you need to ascertain the right VERA bank number, and then offset into the memory space of that bank based on what row/column you are plotting.  Since the output stage is within the pixels_loop all the work to do that stuff gets performed 76,800 times over the course of a run of the program.    But VERA memory can also be filled using one of the VERA data ports.   You do that by poking the high/low components of the address you want to start writing to into a VERA register corresponding to one of the two data ports, and the auto increment step into another.   Then every time you write a value to the VERA data port, VERA itself moves its internal pointer so the next write to the port puts the value at the next memory address according to the increment selected.   No more worrying about VPOKE bank values or offsets from a memory starting point.  So for the first 152 rows of the screen, you save  [i] 76,800 executions of the stuff in original code lines 217 to 220 (3 variable stores, 2 variable fetches, an IF/THEN with a less than evaluation; and [ii] in row 153 you save all of the stuff just mentioned PLUS another line 221 with its two variable fetches and an IF/THEN with two comparison evaluations; and [iii] in the last 127 pixels of row 153 and and rows 154 to 239 you save everything from 'i' as well as another three variable stores, another variable fetch and a subtraction from original code lines 222 to 224;  and [iv]  and all the bytes of BASIC parsing from the foregoing.   AND if you use a variable to hold the VERA data port address, the POKE you substitute for line 230 will have only two variable fetches, whereas the original VPOKE has three more variable fetches, two additions, and a multiplication that are all run all 76,800 times.    Also, as I mentioned in Matt's thread, POKE is itself slightly faster in BASIC than VPOKE, not least of all because POKE parses only two arguments, while VPOKE parses three of them.   

4.  Variable Initialization.   

Scott covered this, and you'll remember the post I put in the prior page of this thread about the "nope, nope, nope.." process of how the BASIC on the X16 (and C64, and C= +4 and C128) use regular scalar variables.   The idea is to figure out the most used variables in the inner most loops and initialize them in priority of frequency of use, so that those that are used most often are also fastest for BASIC to fetch and store. 

5.   Pleasing the parser gods.    

Finally, Scott's thread dealt with removing spaces, crunching lines, single-character variables, using '.' whenever one wants to parse a zero (0) value and the like.   Every little bit helps, especially with the stuff in that inner-most loop running millions of times. 

 

signing off for now... 

Ok, that gets us to the end of the ground already covered by Matt and Scott.   

I believe I have spotted a couple things that should (hopefully?!) improve speeds even further.    As I think I mentioned, I'm doing my testing by grinding out a selection of rows (30 rows per run to be precise) from within the image and actually including output and plotting in my tests.  I'm doing this because I need a variable to use for the VERA data-port address and I want to account for its impact and initialization priority in the benching.  Also, turns out the use of the VERA data port and resulting simple two argument poke will allow me to implement a nifty optimization.  Based on testing so far, I think I've got some genuine improvements here,  but I want to do a few more tweaks in the next few days to really have something really concrete. 

Also I think I may have possibly stumbled onto something a little bizarre, but want to dig into it deeper, including by doing some tests on a C64  and Plus4 on a few different emulators to see if what I think is happening is actually a thing with Commodore BASIC that made its way to the X16.  Sorry to be cryptic, but it is sort of weird and right now something I have yet to fully understand.  

Cheers. 

 

 

 

Snickers11001001
Posts: 140
Joined: Wed Jan 20, 2021 6:43 pm

BASIC: Converting/Optimizing a simple BASIC program from another Commodore platform to the X16...

Post by Snickers11001001 »


FRACTAL BOOGALOO, continued...

Sorry for the delay, but I suddenly have a lot more personal things going on unexpectedly.   I will write this up with some detail later, but it will most probably be several weeks at the very least.  Since I have a few minutes now, I wanted to at least put up the listing for what I was able to do to further optimize this one:

1 TI$="000000":GOSUB7::REM [-......-]

2 SCREEN.:Q=$9F20:FORI=.TO5:READX,Y:POKEQ+X,Y:NEXT:FORV=MTOM+S*239STEPS

3 FORH=TTOUSTEPS:X=.:Y=.:FORI=.TON:Q=X*Y:X=H+X*X-Y*Y:Y=V+Q+Q:NEXT

4 FORI=.TOF:Q=X*X:R=Y*Y:IFQ+R<LTHENY=V+2*X*Y:X=H+Q-R:NEXT:POKEO,.:NEXTH,V:GOTO6

5 POKEO,I:NEXTH:NEXT

6 A$=TI$+"":FORI=1TO6:POKE$819+I,ASC(MID$(A$,I,1)):NEXT:END

7 DIMX,Y,Q,R,V,H,L,I,O,F,N,S,T,U,M:L=4:S=1.125E-7:F=255:N=99:T=-.747345:O=$9F23

8 U=T+S*319:M=.08784:RETURN

9 DATA13,7,15,32,9,17,.,.,1,64,2,16


The original code from Matt's thread takes 15 hours, 8 minutes, 49 seconds  (i.e., 54,529 seconds) to put up the full fractal.     

After my changes, the program completes the full plot in 9 hours, 44 minutes, and 15 seconds. (I.e,  35,055 seconds).   

All testing using the official release emulator and R38 rom.    

Anyway, my updates result in what is still a very long plotting time, but accomplish an approximately 35.7% time savings compared to the original.   That's nothing to sneeze at.

I will get around to writing this up, but in the interim ask any questions and I will get to those with responses when I can.  

Post Reply