Page 1 of 3
Fast Math in assembly language
Posted: Fri May 13, 2022 6:33 pm
by Ed Minchau
When programming in assembly language, quite often you will need to do calculations that would normally require floating point math. Those subroutines are available, the same ones that BASIC uses, but they are very slow. And if you're programming in assembly, you're doing so because it's much faster than BASIC, so using those routines can defeat the purpose, particularly if you need to do a lot of such calculations.
Ideally in assembly language you want to have most of your variables consisting of single bytes, not the four used for floating point. It's having to move all those bytes around and doing calculations involving all of them that makes the floating point subroutines slow. And if you're going for speed, you can make a tradeoff: increased speed at the expense of accuracy. A function involving a single byte can just be a lookup table.
Over the last several years I have developed a number of lookup tables and subroutines that enable some very fast math on the 6502 and now 65c02. Inspired by
@svenvandevelde's question about ATAN2, I decided to compile many of them into a single 8kb file that can be loaded into banked RAM. I have other such routines and lookup tables, but I decided to just include those that would most likely be useful to the broadest range of applications. These lookup tables and functions will work on all revisions of the X16 emulator, and indeed will work on any 65c02 system as long as the file is loaded at $A000. The functions are all called through JMP redirects in page BF, and those JMP table locations will not change if I do revisions on the code. I'm pretty sure I killed all the bugs, but some might have slipped through.
I'll be using further posts in this thread as sort of a user guide, but attached to this post is the file FASTMATH.BIN (the lookup tables and functions) and a text file containing the BASIC program that made all of the lookup tables (but don't copy and paste it into the emulator or it will overwrite FASTMATH.BIN). When using FASTMATH.BIN, you must use a relocated load to put it into banked RAM, otherwise it will load into memory at $7000 and won't work. Or you could load it into its default location at $7000 and just copy it up into the RAM bank of your choice.
This software is released under an MIT license, meaning you can use it freely for any purpose, commercial or non-commercial, as long as you leave the MIT license in there (it's located at $BF7F). And please, these functions are all just approximations accurate to about 1%, so don't use the software for controlling a nuclear power plant or medical equipment.
maketables.txt FASTMATH.BIN
Fast Math in assembly language
Posted: Fri May 13, 2022 6:52 pm
by Ed Minchau
A single byte can represent either an integer in the range 0-255 or a signed integer in the range -128 to 127; the difference is in how we think of what bit 7 represents. If bit 7 is equivalent to 128, it's an unsigned integer, and if it's equivalent to -128 it's a signed integer. The signed integer representation is sometimes called Two's Complement notation.
In two's complement notation, a number can be negated simply by only two lines of code:
EOR #$FF
INC A
So -127 is just 7F EOR FF plus 1 = $81, and -1 is $FF.
FASTMATH uses this two's complement notation, with a couple of twists. First, $80 is considered an invalid number. Only the range from -127 to +127 is used. These numbers are considered the numerator of a fraction, with 127 as the denominator, so -127..127 represents the range [-1.0 .. +1.0]. It isn't quite a fixed point notation; it's a fractional notation. Only the numbers in the range [-1.0 .. 1.0] have any meaning to FASTMATH functions.
The exception to this is trigonometric functions; an angle is not treated the same way as other numbers. For angles, we divide the circle up into 256 "binary degrees" or "bigrees". So for angles, $80 bigrees = pi radians = 180 degrees. So to add 180 degrees to an angle, just EOR #$80.
There are 26 lookup tables in FASTMATH that are potentially useful. For each of these, you can put the input in the X (or Y) register and just read the output: LDA HALF,X will return half the value of X in the accumulator, for instance.
All of these lookup tables are listed as DATA statements in the text file posted above, with REM statements describing each one, so if you just want to use a few of these tables in your own code you can just extract what you need.
Be sure to set the RAM bank pointer to whatever bank holds FASTMATH when using these tables.
name address description
MBY2 A000 multiply by 2, truncated at 7F and 81
HALF A100 multiply by 64/127
MLT5 A200 multiply by 32/127
MLT4 A300 multiply by 16/127
MLT3 A400 multiply by 8/127
MLT2 A500 multiply by 4/127
MLT1 A600 multiply by 2/127
MLT0 A700 multiply by 1/127
FABS A800 absolute value
DIV6 A900 127*64/index; 00 if absolute value of index is less than 64; $80 if index is zero
DIV5 AA00 127*32/index; 00 if absolute value of index is less than 32; $80 if index is zero
DIV4 AB00 127*16/index; 00 if absolute value of index is less than 16; $80 if index is zero
DIV3 AC00 127*8/index; 00 if absolute value of index is less than 8; $80 if index is zero
DIV2 AD00 127*4/index; 00 if absolute value of index is less than 4; $80 if index is zero
DIV1 AE00 127*2/index; 00 if absolute value of index is 1; $80 if index is zero
DIV0 AF00 127/index; $80 if index is zero
FSIN B000 sine of index
FCOS B100 cosine of index
SQRT B200 square root of absolute value of index
SQAR B300 square of index
SQA3 B400 2/3 of square of index
ML3F B500 multiply by 63/127
M2/3 B600 multiply by 2/3
M1/3 B700 multiply by 1/3
FACO B800 arccosine
FLAT B900 arctangent of slopes from -1 to 1
The final two lookup tables aren't directly useful to the user; they're just included here for completeness.
NOR2 BA00 used for 2d normalization
NOR3 BA80 used for 3d normalization
Fast Math in assembly language
Posted: Fri May 13, 2022 7:05 pm
by Ed Minchau
There are 36 functions included in FASTMATH. I'll just list them here for now, and will go into more detail (and show the code) in subsequent posts.
These functions are all given a 4-character label, but you can label them whatever you want in your own code. The JMP redirect adds three precious cycles to each subroutine, but it's worth it to keep your code working even if I revise this file later. The time in cycles (including the JMP redirect and RTS, but not your JSR call) for each subroutine is also listed ; variable time subroutines are listed with a mean time in brackets. Some of those are so short that the call and redirect is a significant fraction of the total time, so for those you may want to simply copy them inline in your code; the code for the really short ones will be listed in subsequent posts when I go into more detail.
It is assumed that if you're using these functions you're not also using BASIC, so these functions use zero page locations F0 to FF as work space.
Be sure to set the RAM bank pointer to whatever bank holds FASTMATH when using these functions.
name addr time description
ARITHMETIC (SCALAR) FUNCTIONS
ADDT BF99 (25) add values in X and Y, truncate on overflow
SUBT BF9C (28) subtract X – Y, truncate on overflow
FAVG BF9F 19 X(64/127) + Y(63/127) {approximately X / 2 + Y / 2}
FHDF BFA2 19 X(64/127) – Y(63/127) {approximately X / 2 – Y / 2}
FCMP BFA5 23 compare X to Y
FMIN BFA8 (39) minimum of X and Y
FMAX BFAB (39) maximum of X and Y
FMLT BFAE 47 multiply X by Y, faster version
MULT BFB1 (87) multiply X by Y, more accurate version
FDIV BFB4 (108) divide X by Y
FDV2 BFB7 (88) divide X by Y, no error checking
SINGLE VECTOR FUNCTIONS
negation
NEXY BFC9 25 [X,Y] = [-X,-Y]
NEG2 BFCC 33 negate 2d vector pointed to by X
NEG3 BFCF 45 negate 3d vector pointed to by X
doubling
VDXY BFD2 21 [X,Y] = 2[X,Y]
VD2D BFD5 37 double the 2d vector pointed to by X
VD3D BFD8 51 double the 3d vector pointed to by X
multiply vector by constant
VML2 BF93 (220) multiply 2d vector (pointed to by X) by the value in Y
VML3 BF96 (326) multiply 3d vector (pointed to by X) by the value in Y
length of a vector
LEXY BFC0 34 length of vector [X,Y] divided by sqrt(2)
LEN2 BFC3 48 length of 2d vector pointed to by X divided by sqrt(2)
LEN3 BFC6 63 length of 3d vector pointed to by X divided by sqrt(3)
trignometry
RTXY BFBA (212) convert radius X angle Y to coordinates (X,Y)
ATN2 BFBD (185) convert coordinates (X,Y) to an angle
normalization to length 1.0
NOXY BFF3 (292) normalize [X,Y] to unit length
NRM2 BFF6 (340) normalize 2d vector pointed to by X to unit length
NRM3 BFF9 (495) normalize 3d vector pointed to by X to unit length
TWO-VECTOR FUNCTIONS
addition and subtraction
VAD2 BFDB 77 add two 2d vectors pointed to by X and Y (divided by 2)
VAD3 BFDE 111 add two 3d vectors pointed to by X and Y (divided by 2)
VSB2 BFE1 77 subtract two 2d vectors pointed to by X and Y (divided by 2)
VSB3 BFE4 111 subtract two 3d vectors pointed to by X and Y (divided by 2)
dot product
DOT2 BFE7 (163) dot product / 2 of two 2d vectors pointed to by X and Y
DOT3 BFEA (263) dot product / 3 of two 3d vectors pointed to by X and Y
angle between two vectors
ANG2 BFED (492) angle between two 2d vectors pointed to by X and Y
ANG3 BFF0 (619) angle between two 3d vectors pointed to by X and Y
cross product
CROS BFFC (486) cross product / 2 of two 3d vectors pointed to by X and Y
Fast Math in assembly language
Posted: Fri May 13, 2022 7:43 pm
by rje
Do you suppose this might be a candidate for one of the spare ROM banks?
Fast Math in assembly language
Posted: Fri May 13, 2022 7:48 pm
by Ed Minchau
On 5/13/2022 at 1:43 PM, rje said:
Do you suppose this might be a candidate for one of the spare ROM banks?
If so I would add about 4kb more in tables and code and the remaining 4kb would echo the Kernel redirects and JSRFAR etc from bank 4.
Fast Math in assembly language
Posted: Fri May 13, 2022 7:49 pm
by rje
It just seems too useful, and there seems to be room for it. I could use it instead of using my own half-baked routines; it could shrink my codebase, which sometimes is a good thing.
Fast Math in assembly language
Posted: Fri May 13, 2022 7:57 pm
by Ed Minchau
On 5/13/2022 at 1:49 PM, rje said:
It just seems too useful, and there appears to be room.
Well, if a bunch of people actually do start using it, then maybe the team will consider it. In the meantime I'll just keep posting more documentation below. I need a few other people to try it anyway just in case I missed a bug somewhere.
I made up the lookup tables on a spreadsheet which also generated the text file above, but I wrote all the code in the META/L editor, which isn't exactly conducive to sharing on GitHub. I'm still in the process of revising the editor for rev40/41 (got sidetracked doing fastmath), but when I release that I'll also include the EDITMATH.PRG file I used to write the code, so people can poke around in it and prod it with tiny sticks. Until then, I'll just be posting short snippets of code here as well, in the standard notation. Maybe I'll include screenshots from the editor too.
Fast Math in assembly language
Posted: Sat May 14, 2022 3:51 am
by svenvandevelde
Ed,
This is very kind of you sharing these libraries. I'm sure that there are more people who have build such libraries over the past (30?) years. I went through your list and what really makes it interesting is to jointly discuss some of these algorithms to understand the logic applied. Like SQRT B200 square root of absolute value of index
Fast Math in assembly language
Posted: Sat May 14, 2022 4:58 am
by Ed Minchau
I can see I need to add titles to that list... done.
In that example, B200 is the address of the start of the square root table. The index into the table is the input, the value stored at (B200 plus the index) is the square root of the index. If the index is negative, the value stored at that address is the square root of the absolute value of the index. i.e.:
Find square root of -1/2:
LDX #$C0; minus one half
LDA SQRT,X; loads the value stored at B2C0
CPX #$80
This returns $5A (90/127 = 0.70866... close enough to 0.7071) and the carry set indicates i.
Something similar can be done with all of those lookup tables. All of them except FSIN and FCOS take one of these fractional data types as the index of a 256 byte LUT. (FSIN and FCOS take an angle as an index.) Simply reading the table offset by X or Y gives the answer, in 4 cycles. The subroutines use several of these tables at once to do more complex things, but the lookup tables are all available for the user too.
Fast Math in assembly language
Posted: Sat May 14, 2022 7:01 am
by Ed Minchau
Arithmetic functions: addition and subtraction
We're using fractions in the range of -1 to 1 as inputs to the addition and subtraction functions. There's a problem with that: the range of the sum (or difference) of two numbers in the range [-1..1] is [-2..2]. There are two ways to handle this problem: either by truncating the result or by scaling. Both methods are available. ADDT and SUBT truncate the result on overflow, and FAVG and FHDF divide the result by two.
Note that for all of the functions listed, the tables used are also included. That way if you only want a few of the functions, you can simply copy the appropriate tables out of the text file above and copy the code listed for each function.
function name ADDT
address BF99
time (mean) (25) cycles
description add values in X and Y, truncate on overflow
inputs X, Y
registers used A
tables used none
outputs A, C, N, Z
notes will clip at $7F on overflow high or at $81 on overflow low;
carry set on overflow; Z set on result 00; N set if result negative
function name SUBT
address BF9C
time (mean) (28) cycles
description subtract X – Y, truncate on overflow
inputs X, Y
registers used A
tables used none
outputs A, C, N, Z
notes will clip at $7F on overflow high or at $81 on overflow low;
carry set on overflow; Z set on result 00; N set if result negative
function name FAVG
address BF9F
time (mean) 19 cycles
description X(64/127) + Y(63/127) {approximately X / 2 + Y / 2}
inputs X, Y
registers used A
tables used HALF, ML3F
outputs A
notes average value of X and Y
function name FHDF
address BFA2
time (mean) 19 cycles
description X(64/127) – Y(63/127) {approximately X / 2 – Y / 2}
inputs X, Y
registers used A
tables used HALF, ML3F
outputs A
notes half the difference between X and Y
Here's what the code looks like in my editor:
note that rather than trying to parse the parameters to figure out what the command addressing mode is,
the META/L editor has a unique 4 character code for each command to remove all ambiguity.
The first three characters are the same as the standard notation, but the fourth is the addressing mode as follows:
A absolute address
X absolute address offset by X
Y absolute address offset by Y
# immediate
I implied (also used for Accumulator mode on those commands that also have Absolute addressing mode, like LSR)
Z zero page indexed by X (i.e. LDA 04,X is LDAZ 04)
/ zero page indirect
+ indexed indirect (i.e. LDA (04,X) is LDA+ 04)
- indirect indexed (i.e. LDA (04),Y is LDA- 04)
0 zero page
there are some exceptions that just made sense to me, such as CLCF for clear carry flag
![arith1.png.b71c1fbafb7ffb35c08e0093b630ee95.png](<___base_url___>/applications/core/interface/js/spacer.png)
FAVG and FHDF are so short that if you're trying to squeeze in as many calculations per second as possible, it's better to write them inline with your code rather than calling the subroutine; 6 cycles for JSR, 3 cycles for the JMP redirect, and 6 cycles for the RTS are more cycles than the calculation itself. My editor uses a nonstandard notation, so in standard notation those are
LDA A100,X
CLC
ADC B500,Y
and
LDA A100,X
SEC
SBC B500,Y