Page 2 of 5

Does this seem eggregious to anyone besides me?

Posted: Sat Nov 20, 2021 8:28 pm
by SlithyMatt

My 16.8 fixed point library will do division by whole numbers (the fraction is lopped off, as I didn't need that, but it looks like you don't, either, if you are just dividing by 60): https://github.com/SlithyMatt/multi-mandlebrot/blob/main/6502/fixedpt24.asm


Does this seem eggregious to anyone besides me?

Posted: Sat Nov 20, 2021 8:34 pm
by BruceMcF


On 11/20/2021 at 3:03 PM, ZeroByte said:




@BruceMcFthe subtraction test says lda N3 .. sta N3.  Is that supposed to be N+3 ?



So n1-3 holds a 16.8? The .8 part isn't a modulo remainder is it?



N be should holding the ".8" (starting as 0, assuming a 16bit dividend), N+1:N+2 holding the 16bit whole part of the dividend, ... rotate it into N+3, trial subtract, inverted carry as borrow in the 6502 family means when the trial subtraction generates a borrow, that is a carry clear, a 0 bit in the result, when the trial subtraction does not generate a borrow, that is carry set, a 1 bit in the result. Save the trial subtraction result when no borrow is generated, so N+3 holds the ongoing remainder through to the end of the process, but the remainder doesn't much matter if you have a 16.8 result.

The "CLC" at the start puts a zero bit in the low bit of the dividend | result, which should be shifted into the carry flag by the 25th iteration of the rotate of the dividend|result location.


Does this seem eggregious to anyone besides me?

Posted: Sat Nov 20, 2021 8:38 pm
by BruceMcF


On 11/20/2021 at 3:28 PM, SlithyMatt said:




My 16.8 fixed point library will do division by whole numbers (the fraction is lopped off, as I didn't need that, but it looks like you don't, either, if you are just dividing by 60): https://github.com/SlithyMatt/multi-mandlebrot/blob/main/6502/fixedpt24.asm



Yes, the most space optimal is "use whatever 16.8 divide you have", if you should happen to have one. 16.8 / 16 -> 16.8 would be slightly slower because of the 16bit trial subtraction, but that's only an issue if optimizing for speed.

With a 16.8 / 16 -> 16.8 in a library ... I would consider testing the high byte of the divisor and using a 16.8 / 8 -> 16.8 internally when the high byte is zero.


Does this seem eggregious to anyone besides me?

Posted: Sat Nov 20, 2021 8:39 pm
by ZeroByte

Just rewatched Ben Eater's video on converting binary to decimal using essentially this algorithm, and I totally get it now.

To round at the end, if N+3 >= $80 then inc the value in N0-2.

I'm going to use this algo. Thanks!

So, a night wasted on producing code that I won't use, but valuable in that it produced more knowledge so worthwhile anyway.


Does this seem eggregious to anyone besides me?

Posted: Sat Nov 20, 2021 8:56 pm
by ZeroByte

I think for fun I'm going to count cycles and compare my algo's speed vs the generic divide by anything algorithm. It'll be I treating to see if I'm faster or slower than just looping.

Wait - Box16's debugger will count them for me. Just step over the initial JSR and look at the cycles indicator.


Does this seem eggregious to anyone besides me?

Posted: Sun Nov 21, 2021 3:04 am
by Ed Minchau

You're dividing a 16 bit number by 60 to determine the number of 1/60th of a second "ticks" of the clock?  Do you really have a maximum duration of one thousand seconds between notes?


Does this seem eggregious to anyone besides me?

Posted: Sun Nov 21, 2021 3:30 am
by ZeroByte

No that's backwards.

VGM's native sample tick rate is 44100Hz. Hence if someone wanted to convert to ZSM format 1:1, the ZSM should advance at 44khz. (That's what I did for my first successful VGM 2 X16 conversion for Kevin back in 2019 when he needed to test YM2151's ability to read from the system bus at 8Mhz.) That works out to 730-something ticks per frame at 60Hz.

So the sample rate header is 16bit. Another example is iD's IMF format. Wolf3d's tick rate is like 11 ticks per frame for some reason... That's 660Hz, which is also a 16bit number.

Obviously it's way too CPU expensive to actually drive an IRQ at 44.1Khz for almost anything other than a dedicated player, but since "sane" values also require more than 8 bits, I figure why not reserve 2 header bytes instead of one?

Going the other way, the slowest tick rate would be 1hz which is 1 tick every 60 frames.

My playback routine's storage format of 16.8 covers this range, as this would be on the order of $0000.02

But such a slow rate is obviously absurd for music.


Does this seem eggregious to anyone besides me?

Posted: Sun Nov 21, 2021 4:00 am
by ZeroByte

Bear in mind that details like this are exactly what I want to spare developers from having to learn about. The basic goal is to provide a turnkey solution for having music playback in projects. The routine I'm doing this math for is a simple way to avoid worrying about how frequently to call the step_music function. Playmusic should be called once per frame and it just steps the music however many times is required by the header data's specification. It keeps track of partial steps and calls step_music one extra time whenever the residual rolls over.

In a nutshell, it "resamples" to 60Hz. It may not use the BEST method, but you're free to generate whatever Hz is specified in the header and call step_music directly if that's what you require.


Does this seem eggregious to anyone besides me?

Posted: Sun Nov 21, 2021 5:14 am
by Ed Minchau

Ok I think I see.

It occurs to me that if you don't care if it's exactly exact, division by 60 is very close to multiplication by 17 followed by division by 1024; the result is about 0.4% low.


Does this seem eggregious to anyone besides me?

Posted: Sun Nov 21, 2021 5:35 am
by ZeroByte

I just did a comparison between my algorithm (bespoke), @BruceMcF's General algorithm, and one that @Scott Robison proposed on Discord (Galaxybrain):

Galaxybrain took my algorithm and enhanced it by using clever byte-order manipulation to cut down the number of shifts required to perform the computation.

For computing 31,217/60 into 16.8 fixed point format: (answer = 48 08 02)

galaxybrain: 46 08 02 , 230 cycles, 159 bytes

bespoke: 46 08 02 , 531 cycles, 122 bytes

general: 48 08 02 , 3813 cycles, 70 bytes


As you can see, bespoke is pretty much what I said - a bit of a balance between speed and size (I could've gotten about 30-60 less cycles, but would've been at least as large as Galaxybrain.

My computation X = X>>6 + X>>10 + X>>14 + X>>18 + X>>22 is slightly off from the answer I computed by hand. Not sure why, as it's the same algorithm, and it didn't have this slight error in my main library. (I assembled all these into a testbed project). Galaxybrain gets the same error though, so not sure what's going on... Anyway, that's the results. As BruceMcF states - the general purpose answer is clearly the winner in size optimization (no shock there). As this calculation is only done once when a song is loaded, that may not be the biggest sacrifice in performance to make. It could probably be optimized a little for my application by skipping the first 5 loops and pre-shifting everything.

For reference, right after starting a song, it's usually going to load quite a few YM bytes (assuming it's got FM tracks) which is slow, about 150 cycles per write (due to having to wait on the YM to be ready for each successive write). So the first two algos pretty much equate to just a few YM writes' worth of time.

Well, that's the tale of the tape, folks. This has been a fun exercise.