On 12/23/2022 at 9:59 PM, Michael Kaiser said:
I'm interested in the performance to decide if it's worth writing faster math functions or just using it.
All day long: It's better and faster to just write faster math functions. All day and in ever way.
But wait! Now I'll completely contradict myself! The reason for this isn't because the math functions are inefficient or slow, it's just that they're generic. As such, it's also a waste of time to try to write a "better math library..." as it will have the same problem of being generic and slow.
What you do is write the math out as much as possible in the first place. Here's an example: Say you have a bunch of on screen things that start from an x,y pixel offset to a base address, as you would have with, say, an X16 app using bitmap graphics to draw stuff. The math is trivial when expressed in C : EffectiveAddr = BaseAddr + (X + (Y *320)) If Base Address is ... say: 0x00000 (The very start of VRAM) and the mode is 320*240, you would be using that calculation everywhere. If you were coding in C, you might just express it as that simple equation all over instead of trying to optimize it. That's a trap of using C, it's so easy to just slap down crazy complicated math, you never stop to think if you could just eliminate the math in the first place.
But wait. What if you never change the mode, and as such never change the number of pixels on a row, and also put them in the same x and y location? Well heck, you need almost no math routines at all then. You can just let the compiler do it with:
offset_address = 168+(204*320)
That just creates a constant that represents a 24 bit number that is the calculated address offset for that feature. In code, you can then use three macros to set the base address, add the offset, then apply that to the VERA registers and VERA address / stride. Like this:
mem_SET_IMM_24 VRAM_bitmap, ZP24_R0
math_ADD_IMM_24 offset_address, ZP24_R0
vera_SET_VRAM_ADDR ZP24_R0, 0, $10 ;Addr0, stride 1
As you can see, I'm a big fan of macros. And I'm a fan of simply NOT DOING math I can just avoid completely. The only math in there is a 24 bit add of the offset into the zero page 24 bit temp storage. And if I wanted to... I could have just rolled the offset into the mem_SET macro by adding it to the IMM param: (VRAM_bitmap+(168+(204*320))
So yeah... make ca65 do the math, or store a pre calculated lookup table in a file and just load it. Make a tool in C that stores the table and is never even IN the x16 app.
But when there's no way around it? Then you use a math routine tuned to the exact data sizes you need in assembler. You can START with generic routines to get it to work, then as refactor the code or see that it needs optimization, "eliminate the math" to simplify the routine.