Fast Math in assembly language

Ed Minchau · Post by **Ed Minchau** » Sat May 14, 2022 7:25 am

Arithmetic operations: compare, minimum, maximum

Since these numbers are in two's complement notation, simple CMP commands won't properly compare the numbers; negative numbers appear bigger than positive to CMP. FCMP produces the correct Z, C, and N flags. If the invalid input of $80 is used, it is considered negative infinity.

function name  FCMP
address        BFA5
time (mean)    23 cycles
description    compare X to Y
inputs         X, Y
registers used A
tables used    none
outputs        Z,N,C
notes          Z set if X = Y, C set if X >= Y, N set if X<Y

The minimum and maximum of two numbers are useful for things like a chess program's minimax algorithm. The maximum is also equivalent to OR in fuzzy logic.

function name  FMIN
address        BFA8
time (mean)    (39) cycles
description    minimum of X and Y
inputs         X, Y
registers used A
tables used    none
outputs        A
notes          returns the minimum

function name  FMAX
address        BFAB
time (mean)    (39) cycles
description    maximum of X and Y
inputs         X, Y
registers used A
tables used    none
outputs        A
notes          returns the maximum

In the META/L editor:

Ed Minchau · Post by **Ed Minchau** » Sat May 14, 2022 8:08 am

arithmetic operations: multiplication and division

There are two different multiplication subroutines. Both of them are fairly accurate. FMLT is about twice as fast as MULT, but it is slightly less accurate and it changes the values in X and Y. MULT isn't 100% accurate but all results are within $01 of the correct value, and MULT preserves the values in X and Y.

function name  FMLT
address        BFAE
time (mean)    47 cycles
description    multiply X by Y, faster version
inputs         X, Y
registers used A, X, Y
outputs        A
tables used    HALF, ML3F
notes          error standard deviation = 0.6757

function name  MULT
address        BFB1
time (mean)    (87) cycles
description    multiply X by Y, more accurate version
inputs         X, Y
registers used A
tables used    FABS,HALF, MLT5, MLT4, MLT3, MLT2, MLT1, MLT0
outputs        A
notes          error standard deviation = 0.4301

The code for FMLT is displayed below. It uses inline versions of FAVG and FHDF, taking advantage of the identity

(x/2 + y/2)^2 - (x/2 - y/2)^2

= x^2/4 + 2xy/4 + y^2/4 - x^2/4 + 2xy/4 - y^2/4

= 4xy/4

= xy

Here's the code for MULT:

MULT just adds fractions if the corresponding bit in the absolute value of X is equal to 1. The BBR commands just skip that addition if the bit in absolute value of X is zero. That absolute value is taken at the beginning if X is negative, and then bit 7 in the temporary zero page variable is set. At the end of the calculation, if bit 7 is set the result is negated. The only command not shown on this image is the RTS.

There are two entries to the division subroutine. FDIV is the first, and does the error checking on the inputs. FDV2 is a little farther down and skips that error checking step.

function name  FDIV
address        BFB4
time (mean)    (108) cycles
description    divide X by Y
inputs         X, Y
registers used A
tables used    FABS, DIV6, DIV5, DIV4, DIV3, DIV2, DIV1, DIV0
outputs        A
notes          error standard deviation = 0.4726, returns carry set if Y=$00 or Y=$80 or X=$80 or abs(X)>abs(Y)

function name  FDV2
address        BFB7
time (mean)    (88) cycles
description    divide X by Y, no error checking
inputs         X, Y
registers used A
tables used    FABS, DIV6, DIV5, DIV4, DIV3, DIV2, DIV1, DIV0
outputs        A
notes          error standard deviation = 0.4726

Here's the code for FDIV in the META/L editor:

The multiplication and division functions are pretty important to the rest of the subroutines, so I wanted to make certain that they were actually close to correct. A nice thing about having two-parameter functions that can each only take 256 possible values is that it's easy to calculate all possible results for all possible combinations of inputs on a spreadsheet, and compare the results to an ideal target. The targets were all done in floating point, and only converted into single byte format at the last step. I was then able to compare the calculations of these subroutines against the target, and make histograms of the differences between calculation and target.

There is another division method using logarithms, but it's only about 20 cycles faster than FDIV and the accuracy is pretty bad, so I didn't include the log/antilog tables in the file:

Ed Minchau · Post by **Ed Minchau** » Sun May 15, 2022 5:03 pm

Vectors are objects with two or more numerical parameters. These can be passed to the vector subroutines in one of three ways:

- the vector X and Y coordinates can be stored in the X and Y registers

- the vector coordinates can be stored in two consecutive locations in zero page, the first of which is pointed to by X or Y

- the vector coordinates can be stored in three consecutive locations in zero page, the first of which is pointed to by X or Y

Single Vector operations: multiplication by -1

function name  NEXY
address        BFC9
time (mean)    25 cycles
description    [X,Y] = [-X,-Y]
inputs         X, Y
registers used A, X, Y
tables used    none
outputs        X, Y
notes          negates the vector in [X,Y]

function name  NEG2
address        BFCC
time (mean)    33 cycles
description    negate 2d vector pointed to by X
inputs         X
registers used A
tables used    none
outputs        zero page
notes          the input vector in zero page pointed to by X is also the result vector

function name  NEG3
address        BFCF
time (mean)    45 cycles
description    negate 3d vector pointed to by X
inputs         X
registers used A
tables used    none
outputs        zero page
notes          the input vector in zero page pointed to by X is also the result vector

Single Vector operations: multiplication by 2

note that these operations do not check for overflow, they simply look up a doubling amount in the MBY2 table, which clips overflow at 7F or 81

function name  VDXY
address        BFD2
time (mean)    21 cycles
description    [X,Y] = 2[X,Y]
inputs         X, Y
registers used A, X, Y
tables used    MBY2
outputs        X, Y
notes          will clip at $7F on overflow high or at $81 on overflow low

function name  VD2D
address        BFD5
time (mean)    37 cycles
description    double the 2d vector pointed to by X
inputs         X
registers used A, Y
tables used    MBY2
outputs        zero page
notes          the input vector in zero page pointed to by X is also the result vector;
               will clip at $7F on overflow high or at $81 on overflow low

function name  VD3D
address        BFD8
time (mean)    51 cycles
description    double the 3d vector pointed to by X
inputs         X
registers used A, Y
tables used    MBY2
outputs        zero page
notes          the input vector in zero page pointed to by X is also the result vector;
               will clip at $7F on overflow high or at $81 on overflow low

Single Vector operations: multiply by a constant

function name  VML2
address        BF93
time (mean)    (220) cycles
description    multiply 2d vector (pointed to by X) by the value in Y
inputs         X, Y
registers used A
tables used    FABS,HALF, MLT5, MLT4, MLT3, MLT2, MLT1, MLT0
outputs        zero page
notes          the input vector in zero page pointed to by X is also the result vector

function name  VML3
address        BF96
time (mean)    (326) cycles
description    multiply 3d vector (pointed to by X) by the value in Y
inputs         X, Y
registers used A
tables used    FABS,HALF, MLT5, MLT4, MLT3, MLT2, MLT1, MLT0
outputs        zero page
notes          the input vector in zero page pointed to by X is also the result vector

Ed Minchau · Post by **Ed Minchau** » Sun May 15, 2022 5:13 pm

Vector Length

function name  LEXY
address        BFC0
time (mean)    34 cycles
description    length of vector [X,Y] divided by sqrt(2)
inputs         X, Y
registers used A
tables used    SQAR, SQRT
outputs        A
notes          length value is divided by sqrt(2) to keep it in the range [0..1]

function name  LEN2
address        BFC3
time (mean)    48 cycles
description    length of 2d vector pointed to by X divided by sqrt(2)
inputs         X
registers used A, Y
tables used    SQAR, SQRT
outputs        A
notes          length value is divided by sqrt(2) to keep it in the range [0..1]

function name  LEN3
address        BFC6
time (mean)    63 cycles
description    length of 3d vector pointed to by X divided by sqrt(3)
inputs         X
registers used A, Y
tables used    SQA3, SQRT
outputs        A
notes          length value is divided by sqrt(3) to keep it in the range [0..1]

Ed Minchau · Post by **Ed Minchau** » Sun May 15, 2022 5:20 pm

Single vector trigonometric operations:

function name  RTXY
address        BFBA
time (mean)    (212) cycles
description    convert radius X angle Y to coordinates (X,Y)
inputs         X, Y
registers used A, X, Y
tables used    FABS,HALF, MLT5, MLT4, MLT3, MLT2, MLT1, MLT0, FSIN, FCOS
outputs        X, Y
notes          if the radius is negative, then the absolute value of the radius is used and the angle changes by $80;
               radius of 00 or $80 returns [0,0]

function name  ATN2
address        BFBD
time (mean)    (185) cycles
description    convert (X,Y) to an angle
inputs         X, Y
registers used A, X, Y
tables used    FABS, DIV6, DIV5, DIV4, DIV3, DIV2, DIV1, DIV0, FLAT
outputs        A
notes          error standard deviation = 0.2881

I did a complete analysis of the ATN2 function, and it turns out it's the most accurate of all the functions; there's only one division in the operation, and when it uses the FLAT table, which returns the arctangent of the slope, the minor errors in division mostly don't matter:

Ed Minchau · Post by **Ed Minchau** » Sun May 15, 2022 6:05 pm

Single vector: Normalization

These subroutines are used if you want to convert a 2d or 3d vector into a vector with the same direction but unit length.

function name  NOXY
address        BFF3
time (mean)    (292) cycles
description    normalize [X,Y] to unit length
inputs         X, Y
registers used A, X, Y
tables used    SQAR, MBY2, FABS,HALF, MLT5, MLT4, MLT3, MLT2, MLT1, MLT0, NOR2
outputs        X, Y, C
notes          carry set if input vector is [0,0]

function name  NRM2
address        BFF6
time (mean)    (340) cycles
description    normalize 2d vector pointed to by X to unit length
inputs         X
registers used A, Y
tables used    SQAR, MBY2, FABS,HALF, MLT5, MLT4, MLT3, MLT2, MLT1, MLT0, NOR2
outputs        zero page, C
notes          the input vector in zero page pointed to by X is also the result vector; carry set if input vector is [0,0]

function name  NRM3
address        BFF9
time (mean)    (495) cycles
description    normalize 3d vector pointed to by X to unit length
inputs         X
registers used A
tables used    SQA3, MBY2, FABS,HALF, MLT5, MLT4, MLT3, MLT2, MLT1, MLT0, NOR3
outputs        zero page, C
notes          the input vector in zero page pointed to by X is also the result vector; carry set if input vector is [0,0,0]

Ed Minchau · Post by **Ed Minchau** » Sun May 15, 2022 6:56 pm

Two-vector operations: addition and subtraction

These functions add or subtract the individual components of two vectors. Since this addition or subtraction could yield a value in the range [-2..2], the result is half the value of the sum or difference, to keep each component in the [-1..1] range.

function name  VAD2
address        BFDB
time (mean)    77 cycles
description    add two 2d vectors pointed to by X and Y
inputs         X, Y
registers used A
tables used    HALF, ML3F
outputs        zero page
notes          result in [FD,FE] is half the length of the actual result to keep all components in the range [-1..1]

function name  VAD3
address        BFDE
time (mean)    111 cycles
description    add two 3d vectors pointed to by X and Y
inputs         X, Y
registers used A
tables used    HALF, ML3F
outputs        zero page
notes          result in [FD,FE,FF] is half the length of the actual result to keep all components in the range [-1..1]

function name  VSB2
address        BFE1
time (mean)    77 cycles
description    subtract two 2d vectors pointed to by X and Y
inputs         X, Y
registers used A
tables used    HALF, ML3F
outputs        zero page
notes          result in [FD,FE] is half the length of the actual result to keep all components in the range [-1..1]

function name  VSB3
address        BFE4
time (mean)    111 cycles
description    subtract two 3d vectors pointed to by X and Y
inputs         X, Y
registers used A
tables used    HALF, ML3F
outputs        zero page
notes          result in [FD,FE,FF] is half the length of the actual result to keep all components in the range [-1..1]

Ed Minchau · Post by **Ed Minchau** » Sun May 15, 2022 7:16 pm

Two-vector operations: dot product and angle between two vectors

The vector dot product is an intermediate step on the way to finding the angle between two vectors. There's a 2d and a 3d version.

function name  DOT2
address        BFE7
time (mean)    (163) cycles
description    dot product / 2 of two 2d vectors pointed to by X and Y
inputs         X, Y
registers used A, X, Y
tables used    HALF, ML3F
outputs        A
notes          result is half the dot product to keep it in the range [-1..1]

function name  DOT3
address        BFEA
time (mean)    (263) cycles
description    dot product / 3 of two 3d vectors pointed to by X and Y
inputs         X, Y
registers used A, X, Y
tables used    HALF, ML3F, M2/3, M1/3
outputs        A
notes          result is one-third the dot product to keep it in the range [-1..1]

The angle between two vectors is given by the equation

theta = arccosine ((VdotW)/((LenV)(LenW)))

If the carry is clear when calling this function the result will be as above. If the carry is set, the result will be the negative of theta.

function name  ANG2
address        BFED
time (mean)    (492) cycles
description    angle between two 2d vectors pointed to by X and Y
inputs         X, Y, C
registers used A, X, Y
tables used    SQAR, SQRT, FABS, DIV6, DIV5, DIV4, DIV3, DIV2, DIV1, DIV0, HALF, ML3F
outputs        A
notes          calculates theta=arccos((X dot Y)/((len X)(len Y))); if carry is set on input, the output angle will be negative

function name  ANG3
address        BFF0
time (mean)    (619) cycles
description    angle between two 3d vectors pointed to by X and Y
inputs         X, Y, C
registers used A, X, Y
tables used    SQA3, SQRT, FABS, DIV6, DIV5, DIV4, DIV3, DIV2, DIV1, DIV0, HALF, ML3F, M2/3, M1/3
outputs        A
notes          calculates theta=arccos((X dot Y)/((len X)(len Y))); if carry is set on input, the output angle will be negative

Note that the dot product is divided by 2 in 2d and divided by 3 in 3d; similarly the lengths are divided by sqrt(2) in 2d and by sqrt(3) in 3d; when combined in the ANG2 and ANG3 functions, these scaling factors all cancel out.

Ed Minchau · Post by **Ed Minchau** » Sun May 15, 2022 7:27 pm

Vector Cross Product

The cross product is the vector perpendicular to the plane described by two vectors; the cross product is a vector that is perpendicular to both of the given vectors. This only works in 3d. This is very useful for things like light reflection when ray tracing. The resulting vector is half of the actual cross product, to prevent results outside the [-1..1] range.

The cross product result vector will be stored in zero page at FD, FE, and FF.

function name  CROS
address        BFFC
time (mean)    (486) cycles
description    cross product / 2 of two 3d vectors pointed to by X and Y
inputs         X, Y
registers used A, X, Y
tables used    HALF, ML3F
outputs        zero page
notes          result vector in [FD,FE,FF] is half the length of the actual cross product to keep all components in the range [-1..1]

BruceMcF · Post by **BruceMcF** » Mon May 16, 2022 11:11 pm

On 5/13/2022 at 3:48 PM, Ed Minchau said:

If so I would add about 4kb more in tables and code and the remaining 4kb would echo the Kernel redirects and JSRFAR etc from bank 4.

This is exactly the kind of ROM home I would like to see for a Sweet16 VM.