Update upcoming VERA firmware (aka "VERA FX")

Jeffrey · Post by **Jeffrey** » Sun Aug 27, 2023 2:45 pm

Here is a bit of an update on the upcoming release of the VERA firmware v0.3.1, also called "VERA FX"

* All work on the firmware itself has been finished.
* All verilog code itself has been reviewed
* Documentation/a tutorial has been created, link here (there is some work to be done on the more advanced topics, but people can start playing around with it)
* The x16-docs and x16-emulator for r44 now contain the new FX features.
* A prerelease (with instructions how to upgrade your own X16 VERA firmware) has been created: link here
* Testing on real HW is ongoing. More tests are being made ready so more tests can be run.
* A demo (or a few) is being worked on to be shown at the upcoming VCF. To make (official) ppl aware.

Regarding the last point: voidstar (on discord) has upgraded his own X16 Dev machine and has made a really nice video of this demo:

I have been working on this FX update for many months now. I hope it will be a success. What is needed now is that it becomes known and that we can reach the point of an actual VERA firmware release and that the new (batch of) X16 machines will contain this new release.

Regards,

JeffreyH
(best known for the STNICCC and Wolf3D demos on this forum)

Guybrush · Post by **Guybrush** » Sun Aug 27, 2023 4:16 pm

Since no one from the team replied to my question on another topic, I'll repeat it here:

I have a (stupid?) question for the people who worked on VERA FX, so here it goes:

Why is there no option to read the entire 32-bit cache in one read operation, since there is an option to write it in one operation?

It would allow for near-DMA speeds when copying data within the video RAM. LDA/STA DATA0/1 is 4 cycles, which would make it possible to copy 4 bytes in just 8 cycles not accounting for loops, but let's add 3 more cycles for that, which makes it 11 cycles, 2.75 cycles per byte. That's pretty damn fast, and still totally under CPU control unlike traditional DMA.

32-bit cache write could stay just as it is right now, with nibble mask and everything, only a read mode would need to be added where all 4 bytes of the 32-bit cache would be loaded (they're already read from memory anyway). As for what would actually be returned to the CPU by the read operation, it could be the first byte or whatever.

voidstar · Post by **voidstar** » Sun Aug 27, 2023 4:27 pm

Was very easy to update the ROMs!

Couple notes about that:

1) on the "release" link mentioned, you have to expand the "Assets" tab to find the actual download links. Maybe on some browsers it's expanded automatically, but for me I overlooked that at first.

2) On the DevBoards and the SD cards that is included, under the SYSTEM folder are the FLASHVERA.PRG, and then an additional ROMFLASH sub-folder under that has the FLASH-CX16.PRG

As another tip - any file I put onto the SD card (new PRGs and ROM.BIN), go ahead and make them uppercase letters since that's just easier for the X16 DOS.

If unsure where the JP1 jumper is on the VERA, it does have a small label next to it on the board. The DevBoard should come with a blue jumper piece on it (shown below). It'll be "hanging" on one pin ("open"), so with transport movement of the system, it's possible that it could come off and misplaced. Being "jumped" means that jumper has to go across both pins there ("closed"). If the blue jumper is missing, it is hard to buy just one, but it would be something like this: <https://www.amazon.com/Uxcell-a13112100 ... ref=sr_1_5>
Or any kind of wire secured between those pins. It is just temporary jumped to allow "write access" to the ROM, when done you take away the jumper.

Since the chips are socketed, if the ROM update gets botched it's not the end of the world - someone just has to mail you a new ROM chip, or send the board to someone who can swap them. That'd be inconvenient, but it's not like the board is stuck forever. But the on-system process goes quick, under a minute.

GREAT JOB on this 3D coding! I still have to read up on exactly how it is working, and looking forward to more 3D works on this system!!

Jeffrey · Post by **Jeffrey** » Sun Aug 27, 2023 5:03 pm

Guybrush wrote: ↑Sun Aug 27, 2023 4:16 pm Since no one from the team replied to my question on another topic, I'll repeat it here:

I have a (stupid?) question for the people who worked on VERA FX, so here it goes:

Why is there no option to read the entire 32-bit cache in one read operation, since there is an option to write it in one operation?

It would allow for near-DMA speeds when copying data within the video RAM. LDA/STA DATA0/1 is 4 cycles, which would make it possible to copy 4 bytes in just 8 cycles not accounting for loops, but let's add 3 more cycles for that, which makes it 11 cycles, 2.75 cycles per byte. That's pretty damn fast, and still totally under CPU control unlike traditional DMA.

32-bit cache write could stay just as it is right now, with nibble mask and everything, only a read mode would need to be added where all 4 bytes of the 32-bit cache would be loaded (they're already read from memory anyway). As for what would actually be returned to the CPU by the read operation, it could be the first byte or whatever.

Reading 4 bytes at the same time into the 32-bit cache was implemented at some point but was way much too expensive from a LUT perspective.

There is a long list of features that I wanted to add (and sonetimes tried to add) but it just doesnt fit on the chip. The ice405k fpga is really small. It was 85-90% full when the first FX work was started (initially just the linedrawer) and there was/is a restriction not to exceed 95% (room for fixing bugs and small features). So there was no room for FX at all (5-10%).

I have spent waay too much time reducing LUT count and trying to add features that reused existing logic inside VERA. It is a miracle that soo much could be added.

So if we had an 8k fpga I would immediatly add many features, but its simply not possible due to the lack of resources in the chip.

Ed Minchau · Post by **Ed Minchau** » Mon Aug 28, 2023 1:49 am

First, you're a wizard.

Second, how long do these operations take? Specifically the 16 bit signed integer multiplication. If it's less than 200 cycles then you've saved me 8 banks of RAM and a kilobyte of low RAM. Do you just set the parameters and then read the result immediately?

mortarm · Post by **mortarm** » Mon Aug 28, 2023 2:11 am

voidstar wrote: ↑Sun Aug 27, 2023 4:27 pm ...so with transport movement of the system, it's possible that it could come off and misplaced.

Since these are placed by hand (assumed), suggest putting a bit of low-tack tape to hold it in place during shipping, incluing a note to that effect.

Jeffrey · Post by **Jeffrey** » Mon Aug 28, 2023 8:23 am

Ed Minchau wrote: ↑Mon Aug 28, 2023 1:49 am First, you're a wizard.

Second, how long do these operations take? Specifically the 16 bit signed integer multiplication. If it's less than 200 cycles then you've saved me 8 banks of RAM and a kilobyte of low RAM. Do you just set the parameters and then read the result immediately?

The FX Update exposes a signed 16bitx16bit multiplier. This means you have to store 2 times 16 bits (so 4 times an sta, each 4 cpu cycles). Then do a write to DATA0/1 which writes the result (32 bit) to VRAM (4 cpu cycles). So thats 20 cpu cycles to do the multiplication, without reading it back.

The result will then be stored in VRAM. You can do many of these calculation in sequence (= more efficient) and have a string of results in VRAM that way or do just one. You can then read the result back out of VRAM. Alternatively you can use the result as input for the next multiplication (if you are doing 3D math, this is very common: a series of multiplications).

In other words (not counting setup time) its around 20-30 cpu cycles per 16bit signed multiplication. Which is a major speed increase.

You can also combine this with accumulation. You can read up on that in the documentation/tutorial.

It is designed to be fast for 3D math (aka "matrix math"), but can be used for anything you want of course.

But thats just one of the features of the VERA FX Update. Here is a list:

A 32-bit cache that gets filled when reading from VRAM (using an lda DATA0 or lda DATA1) one byte (or nibble) at the time but can be written to VRAM 4 bytes at the time (half of a “micro blitter” if you will) when doing an sta DATA0 or sta DATA1. During this write it is also possible to mask parts of the 32-bits (on a nibble level) when writing to VRAM. Furthermore: which part of the cache is filled can be tightly controlled.
The ability to write transparent pixels: when writing to VRAM (using DATA0 or DATA1) the value 0 can be treated differently: it can be ignored and no VRAM bytes will be changed. This is also true when used in combination with the 32-bit cache.
A 4-bit addressing mode that allows you to manipulate the address on a nibble-level. Also added is a nibble incrementer (or decrementer).
An affine transformation helper that allows you to sample (tiled) pixels in such a way that rotation, scaling and shearing is possible. Also (with extra work on the CPU side) a perspective transformation can be achieved. This mode uses a tilemap of (8x8 pixel) tiles. This way tiles can be reused. A map can either be clipped (when combined with transparent writes) or repeated.
A line-draw helper which helps you draw lines on a screen by incrementing the address given a certain (settable) slope. Effectively implementing a Bresenham's line algorithm.
A polygon filler helper which helps you fill polygons/triangles really fast. It allows you to set two slopes: one for each side of a polygon part that has to be filled (see for example here). The helper will tell you the length of the current line to fill and -after filling it yourself- will let you trigger it to increment the x-position of both sides of the polygon part and also sets the address to the starting pixel. Also, the way the helper gives back the fill line length is packed in such a way to make the use of jump tables (a 65C02 feature) very efficient.
A 16-bit x 16-bit multiplier and 32-bit accumulator which allows for signed 16-bit multiplication and 32-bit accumulation. This works from 32-bit cache to VRAM. Special cache-index and address-incrementors are added to facilitate bulk math operations.

Note you can already play with it using the latest emulator!

And the tutorial helps you creating your own "hello world!" programs.

Ed Minchau · Post by **Ed Minchau** » Mon Aug 28, 2023 3:45 pm

Very good. This is really going to help with my dot product and interpolation . I'm thinking I can just put all my variable values into VRAM and then do the sequence of multiplications and additions before reading the final results out of VRAM. Really looking forward to the next Oxygene demo.

Kalvan · Post by **Kalvan** » Tue Aug 29, 2023 12:51 am

All of the sudden, the Commander X-16 goes from being the rough equal to the Sharp X-1 Turbo or the NEC PC8812/4 to better than the PC Engine/TurboGrafix-16 in terms of prowess in converting Sega, Taito, Konami, or Atari racing or behind-the-back shooter arcade games.

Lookingv forward to versions of Space Harrier, Afterburner, Road Blasters, and Aqua Jack only slightly worse thein their Sharp X68000 and Fujitsu FM Towns versions. Yes, the affine elements are limited to bitmaps and tilemaps, but even if you limit them to pieces of otherwise stationary scenary, that means player character sprites only need four frames each, and NPCs/Enemies, bullets, and missiles only need, say, 16 frames each. Quite doable if you budget your recources well. Plus, you can used polygons for big bosses and blocky buildings in urban areas.

Not only that, but the FX block is the final piece of the puzzle nescessary to implement the 16-bit Sonic games on the X-16, even Sonic CD and 3D Blast, well except for the soundtrack of the former. Yes, you can even do the special stages of Sonic 3&Knuckles and 3D Blast, if it's on a full 2 MB System RAM, you're willing to stick it on a 4 MB cartridge with a compression chip, and decompress some of those assets and engines to system RAM.

Oh, if only we had gotten this machine (albeit with full backward compatibility with the VIC-20 and 64 aside from the ilegal opcodes) from Commodore instead of the 128 back in the day. With GEOS it could have been an original Macintosh killer and near peer of the Sharp X68000 for MSX2 levels of money, and with arbitrary line drawing, Chinese/Japanese/Korean text modes become doable without the need for expensive Kanji, Kana, and Hangul ROMs.

And by the time the X68000, Canon CAT, Sony News, and FM Towns debut, the followups featuring 16-bit and 32-bit chips should be about ready for release...

Commander X16

Update upcoming VERA firmware (aka "VERA FX")

Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")

Re: Update upcoming VERA firmware (aka "VERA FX")