New demo uploaded: Wolfenstein 3D - raycasting demo with textures

Ed Minchau · Post by **Ed Minchau** » Tue Mar 16, 2021 3:45 am

4 hours ago, desertfish said:

If you make the textures mirrored across the Y axis you can cut the required texture samples in half again, and just mirror the bottom half from the upper half of the screen.

Don't know if you wanna do this because it may start looking really ugly if the textures are constrained this way perhaps....

Also for non-wall textures (monster sprites?) this surely is a nasty restriction

He's using the wolfenstein 3d textures; unfortunately they don't mirror either vertically or horizontally.

Jeffrey · Post by **Jeffrey** » Tue Mar 16, 2021 5:25 am

What I am noticing is that the raytracing itself (the dda algorithm) is right now around twice as expensive as the blitting to the screen. So halfing vertically won't give you much speed improvement. Halving horizontally would certainly help.

What is more interesting right now is how to do the dda-algorithm quickly in assembly. Right now, it does for each ray:

a tan() and inverse-tan() lookup resulting in two 16-bit values (x_step and y_step)

two multiplications for determining the initial intersection points inside the cell you are standing (x_intercept and y_intercept)

quite a lot of branches to implement the logic used by the dda-algorithm (including copying 16-bit numbers)

several decrementers, incrementers, subtractions and additions of 16 bit numbers

bit-shifters to do a lookup in the world-map table(s)

two multiplications of an 16-bit and 8 bit value (using x, y distance and cos/sin to get to the distance from the camera plane)

a divide of a 16-bit value (the distance to the wall) by a 16-bit constant resulting in a wall height (16-bit) --> expensive! (want to use lookup tables)

a capping of the wall height (16-bit) into a render height (byte)

lots a small little details

For setting up the rays I also change the input so that I only have to do the logic for one quadrant.

My gut feeling is that the above should take (maybe) several hunderds of cycles. Maybe 300-400? So 304 rays * 300-400 cycles = 90,000 to 120,000 cycles. So maybe 1 tick. Yet it is spending about 7-8 ticks now. So much room for improvement I think.

Basicly I implement the logic described in this video:

It would be cool if we could iterate together by suggesting / showing to each other what example assembly snippets would be faster in order to bring down the cycle count needed for this algorithm. ?

First I have to release though. So back to doing some (much needed) cleanup again ?

izb · Post by **izb** » Tue Mar 16, 2021 7:05 am

You have per height routines, but could you also have a per texture+height routine? You could take advantage of textures created with vertical runs of identical pixels to reduce “reads” from the texture. I’d imagine the answer is that that would use a ton of memory, but I’m wondering how much memory the routines are using up now?

Sent from my iPhone using Tapatalk

Jeffrey · Post by **Jeffrey** » Tue Mar 16, 2021 7:50 am

48 minutes ago, izb said:

You have per height routines, but could you also have a per texture+height routine? You could take advantage of textures created with vertical runs of identical pixels to reduce “reads” from the texture. I’d imagine the answer is that that would use a ton of memory, but I’m wondering how much memory the routines are using up now?

Sent from my iPhone using Tapatalk

Thanks. That would be too much ram usage. Right now the longest routine is 64 reads and 182 writes. So thats (64+182)*3 bytes. So I reserve 1kb per routine now (for fast access to the routines). That covers all of banked ram right now. I can probably pack it better, but 512 roitines times the amount of textures would be way too much memory.

Another way to reduce (dummy) reads is to have (smaller) textures in normal memory (but "striped" pixel by pixel) and simply hard code the needed read addresses (with an x-index for the texture index) in each routine for all walls smaller than the texture height.

For example:

LDA $5603, X

STA VERA_DATA0

LDA $5719, X

STA VERA_DATA0

...

Where X contains the texture index.

ZeroByte · Post by **ZeroByte** » Tue Mar 16, 2021 5:57 pm

Well, if you're going to squash vertical pixels, you may as well squash the horizontal resolution the same amount, scale it back up with VERA, and use the extra space in the single screen bitmap as double buffering space.

I think even a 286 couldn't play Wolf3d at full resolution / FPS.

Jeffrey · Post by **Jeffrey** » Wed Mar 17, 2021 11:16 am

New version uploaded: ? ?

Cunnah · Post by **Cunnah** » Wed Mar 17, 2021 1:07 pm

Looks great!

I'm curious if you are doing any culling when you render out the world?

desertfish · Post by **desertfish** » Wed Mar 17, 2021 1:57 pm

I don't think a raycaster needs special culling because you only render the exact number of vertical pixel columns with the first wall (texture) that is hit with the view ray. So any walls 'behind' others are never seen by the algorithm.

ZeroByte · Post by **ZeroByte** » Wed Mar 17, 2021 2:23 pm

It's a shame that OPL and OPM are totally different animals, otherwise the AdLib sound is also right there for the playing, too. I DL'd the Wolf3d source to look into that possibility, but haven't come up with any ideas for how to convert between the two on the fly.

Ed Minchau · Post by **Ed Minchau** » Wed Mar 17, 2021 7:33 pm

On 3/16/2021 at 11:57 AM, ZeroByte said:

Well, if you're going to squash vertical pixels, you may as well squash the horizontal resolution the same amount, scale it back up with VERA, and use the extra space in the single screen bitmap as double buffering space.

I think even a 286 couldn't play Wolf3d at full resolution / FPS.

I was thinking about that too. If the Vera HScale is $33, that would show 256 columns of pixels, and a Vscale of $22 would show 128 rows; so that's 32 kb for a screen at 8bpp. A tile map could just be 32x32 of 8x8 tiles (so 1024 tiles) and give two full screens and can wait for VSYNC to flip screens. I'd use $08000 to $17FFF for the tile data and move the layer 1 tile data down to $03800 and the layer 0 tile map at $03000.

The tiles would be arranged in columns so the first column would be tiles 00 01 02 03 etc; the increment for a column would be 8 instead of 320. There would just need to be a couple of lookup tables for the low byte/high byte of the first pixel in a column to initialize the VRAM address pointer; maybe two such sets of lookup tables for the two screens.

The old Wolfenstein raycasting method could be improved by borrowing the binary space partition idea from Doom. Anything that can be done to reduce the computational overhead for finding which column of pixels to show at what height value will speed this up a lot.