Page 1 of 3

Fast Math in assembly language

Posted: Fri May 13, 2022 6:33 pm
by Ed Minchau

When programming in assembly language, quite often you will need to do calculations that would normally require floating point math.  Those subroutines are available, the same ones that BASIC uses, but they are very slow.  And if you're programming in assembly, you're doing so because it's much faster than BASIC, so using those routines can defeat the purpose, particularly if you need to do a lot of such calculations.

Ideally in assembly language you want to have most of your variables consisting of single bytes, not the four used for floating point.  It's having to move all those bytes around and doing calculations involving all of them that makes the floating point subroutines slow.  And if you're going for speed, you can make a tradeoff: increased speed at the expense of accuracy.  A function involving a single byte can just be a lookup table.

Over the last several years I have developed a number of lookup tables and subroutines that enable some very fast math on the 6502 and now 65c02.  Inspired by @svenvandevelde's question about ATAN2, I decided to compile many of them into a single 8kb file that can be loaded into banked RAM.  I have other such routines and lookup tables, but I decided to just include those that would most likely be useful to the broadest range of applications.  These lookup tables and functions will work on all revisions of the X16 emulator, and indeed will work on any 65c02 system as long as the file is loaded at $A000.  The functions are all called through JMP redirects in page BF, and those JMP table locations will not change if I do revisions on the code.  I'm pretty sure I killed all the bugs, but some might have slipped through.

I'll be using further posts in this thread as sort of a user guide, but attached to this post is the file FASTMATH.BIN (the lookup tables and functions) and a text file containing the BASIC program that made all of the lookup tables (but don't copy and paste it into the emulator or it will overwrite FASTMATH.BIN).  When using FASTMATH.BIN, you must use a relocated load to put it into banked RAM, otherwise it will load into memory at $7000 and won't work.  Or you could load it into its default location at $7000 and just copy it up into the RAM bank of your choice.

This software is released under an MIT license, meaning you can use it freely for any purpose, commercial or non-commercial, as long as you leave the MIT license in there (it's located at $BF7F).  And please, these functions are all just approximations accurate to about 1%, so don't use the software for controlling a nuclear power plant or medical equipment.
maketables.txt
FASTMATH.BIN

Fast Math in assembly language

Posted: Fri May 13, 2022 6:52 pm
by Ed Minchau

A single byte can represent either an integer in the range 0-255 or a signed integer in the range -128 to 127; the difference is in how we think of what bit 7 represents.  If bit 7 is equivalent to 128, it's an unsigned integer, and if it's equivalent to -128 it's a signed integer.   The signed integer representation is sometimes called Two's Complement notation.  

In two's complement notation, a number can be negated simply by only two lines of code:

EOR #$FF

INC A

So -127 is just 7F EOR FF plus 1 = $81, and -1 is $FF.

FASTMATH uses this two's complement notation, with a couple of twists.  First, $80 is considered an invalid number.  Only the range from -127 to +127 is used.  These numbers are considered the numerator of a fraction, with 127 as the denominator, so -127..127 represents the range [-1.0 .. +1.0].  It isn't quite a fixed point notation; it's a fractional notation.  Only the numbers in the range [-1.0 .. 1.0] have any meaning to FASTMATH functions.

The exception to this is trigonometric functions; an angle is not treated the same way as other numbers.  For angles, we divide the circle up into 256 "binary degrees" or "bigrees".  So for angles, $80 bigrees = pi radians = 180 degrees. So to add 180 degrees to an angle, just EOR #$80.

There are 26 lookup tables in FASTMATH that are potentially useful.  For each of these, you can put the input in the X (or Y) register and just read the output:  LDA HALF,X will return half the value of X in the accumulator, for instance.

All of these lookup tables are listed as DATA statements in the text file posted above, with REM statements describing each one, so if you just want to use a few of these tables in your own code you can just extract what you need.

Be sure to set the RAM bank pointer to whatever bank holds FASTMATH when using these tables.

name  address  description

MBY2    A000    multiply by 2, truncated at 7F and 81

HALF    A100    multiply by 64/127

MLT5    A200    multiply by 32/127

MLT4    A300    multiply by 16/127

MLT3    A400    multiply by 8/127

MLT2    A500    multiply by 4/127

MLT1    A600    multiply by 2/127

MLT0    A700    multiply by 1/127

FABS    A800    absolute value

DIV6    A900    127*64/index; 00 if absolute value of index is less than 64; $80 if index is zero

DIV5    AA00    127*32/index; 00 if absolute value of index is less than 32; $80 if index is zero

DIV4    AB00    127*16/index; 00 if absolute value of index is less than 16; $80 if index is zero

DIV3    AC00    127*8/index; 00 if absolute value of index is less than 8; $80 if index is zero

DIV2    AD00    127*4/index; 00 if absolute value of index is less than 4; $80 if index is zero

DIV1    AE00    127*2/index; 00 if absolute value of index is 1; $80 if index is zero

DIV0    AF00    127/index; $80 if index is zero

FSIN    B000    sine of index

FCOS    B100    cosine of index

SQRT    B200    square root of absolute value of index

SQAR    B300    square of index

SQA3    B400    2/3 of square of index

ML3F    B500    multiply by 63/127

M2/3    B600    multiply by 2/3

M1/3    B700    multiply by 1/3

FACO    B800    arccosine

FLAT    B900    arctangent of slopes from -1 to 1

The final two lookup tables aren't directly useful to the user; they're just included here for completeness.

NOR2    BA00    used for 2d normalization

NOR3    BA80    used for 3d normalization


Fast Math in assembly language

Posted: Fri May 13, 2022 7:05 pm
by Ed Minchau

There are 36 functions included in FASTMATH.  I'll just list them here for now, and will go into more detail (and show the code) in subsequent posts.

These functions are all given a 4-character label, but you can label them whatever you want in your own code.  The JMP redirect adds three precious cycles to each subroutine, but it's worth it to keep your code working even if I revise this file later.  The time in cycles (including the JMP redirect and RTS, but not your JSR call) for each subroutine is also listed ; variable time subroutines are listed with a mean time in brackets.  Some of those are so short that the call and redirect is a significant fraction of the total time, so for those you may want to simply copy them inline in your code;  the code for the really short ones will be listed in subsequent posts when I go into more detail.

It is assumed that if you're using these functions you're not also using BASIC, so these functions use zero page locations F0 to FF as work space.

Be sure to set the RAM bank pointer to whatever bank holds FASTMATH when using these functions.
    name    addr    time    description
ARITHMETIC (SCALAR) FUNCTIONS                
    ADDT    BF99    (25)    add values in X and Y, truncate on overflow
    SUBT    BF9C    (28)    subtract X – Y, truncate on overflow
    FAVG    BF9F    19     X(64/127) + Y(63/127) {approximately X / 2 + Y / 2}
    FHDF    BFA2    19     X(64/127) – Y(63/127) {approximately X / 2 – Y / 2}
    FCMP    BFA5    23     compare X to Y
    FMIN    BFA8    (39)    minimum of X and Y
    FMAX    BFAB    (39)    maximum of X and Y
    FMLT    BFAE    47     multiply X by Y, faster version
    MULT    BFB1    (87)    multiply X by Y, more accurate version
    FDIV    BFB4    (108)   divide X by Y
    FDV2    BFB7    (88)    divide X by Y, no error checking
                
SINGLE VECTOR FUNCTIONS            
negation                
    NEXY    BFC9    25     [X,Y] = [-X,-Y]
    NEG2    BFCC    33     negate 2d vector pointed to by X
    NEG3    BFCF    45     negate 3d vector pointed to by X
doubling                
    VDXY    BFD2    21     [X,Y] = 2[X,Y]
    VD2D    BFD5    37     double the 2d vector pointed to by X
    VD3D    BFD8    51     double the 3d vector pointed to by X
multiply vector by constant                
    VML2    BF93    (220)   multiply 2d vector (pointed to by X) by the value in Y
    VML3    BF96    (326)   multiply 3d vector (pointed to by X) by the value in Y
length of a vector                
    LEXY    BFC0    34     length of vector [X,Y] divided by sqrt(2)
    LEN2    BFC3    48     length of 2d vector pointed to by X divided by sqrt(2)
    LEN3    BFC6    63     length of 3d vector pointed to by X divided by sqrt(3)
trignometry                
    RTXY    BFBA    (212)   convert radius X angle Y to coordinates (X,Y)
    ATN2    BFBD    (185)   convert coordinates (X,Y) to an angle
normalization to length 1.0                
    NOXY    BFF3    (292)   normalize [X,Y] to unit length
    NRM2    BFF6    (340)   normalize 2d vector pointed to by X to unit length
    NRM3    BFF9    (495)   normalize 3d vector pointed to by X to unit length
                
                
TWO-VECTOR FUNCTIONS            
addition and subtraction                
    VAD2    BFDB    77     add two 2d vectors pointed to by X and Y (divided by 2)
    VAD3    BFDE    111     add two 3d vectors pointed to by X and Y (divided by 2)
    VSB2    BFE1    77     subtract two 2d vectors pointed to by X and Y (divided by 2)
    VSB3    BFE4    111     subtract two 3d vectors pointed to by X and Y (divided by 2)
dot product                
    DOT2    BFE7    (163)   dot product / 2 of two 2d vectors pointed to by X and Y
    DOT3    BFEA    (263)   dot product / 3 of two 3d vectors pointed to by X and Y
angle between two vectors                
    ANG2    BFED    (492)   angle between two 2d vectors pointed to by X and Y
    ANG3    BFF0    (619)   angle between two 3d vectors pointed to by X and Y
cross product                
    CROS    BFFC    (486)   cross product / 2 of two 3d vectors pointed to by X and Y


Fast Math in assembly language

Posted: Fri May 13, 2022 7:43 pm
by rje

Do you suppose this might be a candidate for one of the spare ROM banks?

 


Fast Math in assembly language

Posted: Fri May 13, 2022 7:48 pm
by Ed Minchau


On 5/13/2022 at 1:43 PM, rje said:




Do you suppose this might be a candidate for one of the spare ROM banks?



 



If so I would add about 4kb more in tables and code and the remaining 4kb would echo the Kernel redirects and JSRFAR etc from bank 4.


Fast Math in assembly language

Posted: Fri May 13, 2022 7:49 pm
by rje

It just seems too useful, and there seems to be room for it.  I could use it instead of using my own half-baked routines; it could shrink my codebase, which sometimes is a good thing.

 

 


Fast Math in assembly language

Posted: Fri May 13, 2022 7:57 pm
by Ed Minchau


On 5/13/2022 at 1:49 PM, rje said:




It just seems too useful, and there appears to be room.



 



Well, if a bunch of people actually do start using it, then maybe the team will consider it.  In the meantime I'll just keep posting more documentation below.  I need a few other people to try it anyway just in case I missed a bug somewhere.  

I made up the lookup tables on a spreadsheet which also generated the text file above, but I wrote all the code in the META/L editor, which isn't exactly conducive to sharing on GitHub.  I'm still in the process of revising the editor for rev40/41 (got sidetracked doing fastmath), but when I release that I'll also include the EDITMATH.PRG file I used to write the code, so people can poke around in it and prod it with tiny sticks.  Until then, I'll just be posting short snippets of code here as well, in the standard notation.  Maybe I'll include screenshots from the editor too.


Fast Math in assembly language

Posted: Sat May 14, 2022 3:51 am
by svenvandevelde

Ed,

This is very kind of you sharing these libraries. I'm sure that there are more people who have build such libraries over the past (30?) years. I went through your list and what really makes it interesting is to jointly discuss some of these algorithms to understand the logic applied.  Like SQRT    B200    square root of absolute value of index


Fast Math in assembly language

Posted: Sat May 14, 2022 4:58 am
by Ed Minchau

I can see I need to add titles to that list... done.

In that example, B200 is the address of the start of the square root table. The index into the table is the input, the value stored at (B200 plus the index) is the square root of the index.  If the index is negative,  the value stored at that address is the square root of the absolute value of the index.  i.e.:

Find square root of -1/2:

LDX #$C0;  minus one half

LDA SQRT,X; loads the value stored at B2C0

CPX #$80

This returns $5A (90/127 = 0.70866... close enough to 0.7071) and the carry set indicates i. 

 

Something similar can be done with all of those lookup tables.  All of them except FSIN and FCOS take one of these fractional data types as the index of a 256 byte LUT.  (FSIN and FCOS take an angle as an index.) Simply reading the table offset by X or Y gives the answer, in 4 cycles.  The subroutines use several of these tables at once to do more complex things, but the lookup tables are all available for the user too.


Fast Math in assembly language

Posted: Sat May 14, 2022 7:01 am
by Ed Minchau

Arithmetic functions: addition and subtraction

We're using fractions in the range of -1 to 1 as inputs to the addition and subtraction functions.  There's a problem with that: the range of the sum (or difference) of two numbers in the range [-1..1] is [-2..2].  There are two ways to handle this problem: either by truncating the result or by scaling.  Both methods are available.  ADDT and SUBT truncate the result on overflow, and FAVG and FHDF divide the result by two.

Note that for all of the functions listed, the tables used are also included.  That way if you only want a few of the functions, you can simply copy the appropriate tables out of the text file above and copy the code listed for each function.
function name  ADDT    
address        BF99    
time (mean)    (25) cycles
description    add values in X and Y, truncate on overflow    
inputs        X, Y    
registers used A  
tables used none
outputs        A, C, N, Z    
notes         will clip at $7F on overflow high or at $81 on overflow low;
carry set on overflow; Z set on result 00; N set if result negative

function name  SUBT
address        BF9C
time (mean)    (28) cycles
description    subtract X – Y, truncate on overflow
inputs         X, Y
registers used A
tables used none
outputs        A, C, N, Z
notes         will clip at $7F on overflow high or at $81 on overflow low;
carry set on overflow; Z set on result 00; N set if result negative

function name  FAVG
address        BF9F
time (mean)    19 cycles
description    X(64/127) + Y(63/127) {approximately X / 2 + Y / 2}
inputs         X, Y
registers used A
tables used HALF, ML3F
outputs        A
notes         average value of X and Y

function name  FHDF
address        BFA2
time (mean)    19 cycles
description    X(64/127) – Y(63/127) {approximately X / 2 – Y / 2}
inputs         X, Y
registers used A
tables used HALF, ML3F
outputs        A
notes         half the difference between X and Y

Here's what the code looks like in my editor:
note that rather than trying to parse the parameters to figure out what the command addressing mode is, 
the META/L editor has a unique 4 character code for each command to remove all ambiguity. 
The first three characters are the same as the standard notation, but the fourth is the addressing mode as follows:
A absolute address
X absolute address offset by X
Y absolute address offset by Y
# immediate
I implied (also used for Accumulator mode on those commands that also have Absolute addressing mode, like LSR)
Z zero page indexed by X (i.e. LDA 04,X is LDAZ 04)
/ zero page indirect
+ indexed indirect (i.e. LDA (04,X) is LDA+ 04)
- indirect indexed (i.e. LDA (04),Y is LDA- 04)
0 zero page
there are some exceptions that just made sense to me, such as CLCF for clear carry flag

arith1.png.b71c1fbafb7ffb35c08e0093b630ee95.png

FAVG and FHDF are so short that if you're trying to squeeze in as many calculations per second as possible, it's better to write them inline with your code rather than calling the subroutine; 6 cycles for JSR, 3 cycles for the JMP redirect, and 6 cycles for the RTS are more cycles than the calculation itself.  My editor uses a nonstandard notation, so in standard notation those are
LDA A100,X
CLC
ADC B500,Y

and
LDA A100,X
SEC
SBC B500,Y