What I am noticing is that the raytracing itself (the dda algorithm) is right now around twice as expensive as the blitting to the screen. So halfing vertically won't give you much speed improvement. Halving horizontally would certainly help.
What is more interesting right now is how to do the dda-algorithm
quickly in assembly. Right now, it does for each ray:
a tan() and inverse-tan() lookup resulting in two 16-bit values (x_step and y_step)
two multiplications for determining the initial intersection points inside the cell you are standing (x_intercept and y_intercept)
quite a lot of branches to implement the logic used by the dda-algorithm (including copying 16-bit numbers)
several decrementers, incrementers, subtractions and additions of 16 bit numbers
bit-shifters to do a lookup in the world-map table(s)
two multiplications of an 16-bit and 8 bit value (using x, y distance and cos/sin to get to the distance from the camera plane)
a divide of a 16-bit value (the distance to the wall) by a 16-bit constant resulting in a wall height (16-bit) --> expensive! (want to use lookup tables)
a capping of the wall height (16-bit) into a render height (byte)
lots a small little details
For setting up the rays I also change the input so that I only have to do the logic for one quadrant.
My gut feeling is that the above should take (maybe) several hunderds of cycles. Maybe 300-400? So 304 rays * 300-400 cycles = 90,000 to 120,000 cycles. So maybe 1 tick. Yet it is spending about 7-8 ticks now. So much room for improvement I think.
Basicly I implement the logic described in this video:
It would be cool if we could iterate together by suggesting / showing to each other what example assembly snippets would be faster in order to bring down the cycle count needed for this algorithm.
? First I have to release though. So back to doing some (much needed) cleanup again ?