32X rendering engines

ob1
Moldy Popcorn

Posts: 29

32X rendering engines Mar 6, 2007 3:32:02 GMT -5

Quote

Post by ob1 on Mar 6, 2007 3:32:02 GMT -5

I've created this topic since "32x cartridge bigger than 32Mbits" isn't very well suited. The aim of this topic is to discuss about using the 32X as a 3D rendering engine for the 68000.
We could think of it as a tile rendering engine, or at anything else, but it's not the case here.
I've already discussed of this subject on gendev.spritesmind.net. You'll also find a 32X debug version of Gens, enhanced by Kaneda : GensKMod 0.7.

Here we go.

gigabite said:

No, I mean doing it 3dfx style and having each SH2 render different frames.

<68000 does game data>

SH2's do graphics in 32x

SH2 #1 renders frame 1 and sends to 32x VDP
SH2 #2 renders frame 2 and sends to 32x VDP
SH2 #1 renders frame 3 and sends to 32x VDP
SH2 #2 renders frame 4 and sends to 32x VDP

etc.

OK, I got the thing. But I think there could be another way :
For each frame,
SH2 #1 renders a polygon and sends to 32X VDP FILL
SH2 #2 renders a polygon and sends to 32X VDP FILL
SH2 #1 renders a polygon and sends to 32X VDP FILL
SH2 #2 renders a polygon and sends to 32X VDP FILL
...
Ok, it would need a little bit of synchronization for accessing the VDP FILL.

pmjobin said:

you can't run the same code on both of them and expect good performances

I quite disagree. The cache can handle both data and instructions. So, once the program has been read once, it resides in cache until repressed by LRU algorithm. But I don't think it would occur, whatever intensive rendering could be : instructions are fetched a lot more, importing to me (but don't you forget I'm still a real newbie

).

pmjobin said:

What I suggest is to only use the slave SH to render into the framebuffer

The Hardware Manual states the same : "Within 2 SH2 units, it is normal for the master to control the entire 32X and the slave to restore the computing element inside SH2 and works expecially in numerical computing"
It is a kind of cooperative multi tasking. What I'm afraid of is with this way, the master wouldn't handle a lot of thing, while the slave CPU would be under heavy loads. "preemptive" multi tasking would balance the load a bit more, even if sync is needed.

pmjobin said:

the slave SH can only lock access to the bus when the master frees it

OK, but does the Master locks the bus as often ? Master and Slave meanwhile accesing the bus, the Slave would have to wait. But, quite soon, the Master will release it, and the Slave could write to it. How much ? 1 cycle, 2 ? Is wasting up to 2 cycles really slower that wasting none but using half the CPU power ? I really don't know, I just want to have some clues.

pmjobin said:

the cache is shared so data access can only occur when the MOV instruction is executed in the leading IF stage

I think I miss some skil since I didn't got anything from this. The cache is shared, so instructions and data are mixed in the cache. When could data access occur, apart from MOV ? And even if data access occur only in the leading IF stage, there is 1 IF stage / cycle, so is it such a concern ?

As already mentionned, I haven't a lot of skill in 32X developing. So, what I state here is theory only, and might miss of knowledge.

pmjobin
Jedi Master Material

Boink!

Posts: 54

32X rendering engines Mar 6, 2007 12:30:11 GMT -5

Quote

Post by pmjobin on Mar 6, 2007 12:30:11 GMT -5

We could think of it as a tile rendering engine, or at anything else, but it's not the case here.

I wasn't referring to that kind of tile engine, but rather tile based deferred rendering. The very same technique that is used by the CLX2 chip inside the Dreamcast. You can get a good overview of what this is all about here:
www.beyond3d.com/articles/tilebasedrendering/index1.php

OK, I got the thing. But I think there could be another way :
For each frame,
SH2 #1 renders a polygon and sends to 32X VDP FILL
SH2 #2 renders a polygon and sends to 32X VDP FILL
SH2 #1 renders a polygon and sends to 32X VDP FILL
SH2 #2 renders a polygon and sends to 32X VDP FILL
...
Ok, it would need a little bit of synchronization for accessing the VDP FILL.

Yes, this could be done in order to prevent the bus fight that would occur using GiGaBiTe's approach. You could for example use the operand cache "RAM mode" feature of the SH in order to perform the polygon transformation and scan conversion in one SH entirely in cache while the other SH rasterize a polygon. But the load on each SH would be uneven and you'd end up taking a huge performance hit.

you can't run the same code on both of them and expect good performances

I quite disagree. The cache can handle both data and instructions. So, once the program has been read once, it resides in cache until repressed by LRU algorithm. But I don't think it would occur, whatever intensive rendering could be : instructions are fetched a lot more, importing to me (but don't you forget I'm still a real newbie ).

Yeah, a very tight inner loop that is being executed will most likely stay in cache, but this isn't the case for texture data. Texture access is far from being a "linear" operation and you'll end up with lots and lots of nasty cache misses, unless you use very small textures (less than 32X32). For each cache miss, the SH must fetch the corresponding data from SD-RAM so the bus is kind of clogged by the SH that perform rasterization. When I was saying that you cannot execute the same code on both SHs and expect good performances, I was referring to the master > slave configuration. If the master performs R/W loops that goes through the external bus, the slave CANNOT be granted access to the bus. So both SH really don't behave the same way.

The Hardware Manual states the same : "Within 2 SH2 units, it is normal for the master to control the entire 32X and the slave to restore the computing element inside SH2 and works expecially in numerical computing".

I prefer to see this the other way around. Have the slave SH perform all the bus accesses while the master does the heavy math. This way, the master can take over the bus from the slave whenever it needs access to it.

Master: ALU ALU READ ALU ALU WRITE ...
Slave: READ READ WAIT WRITE READ WAIT ...

If you follow the recommendation of the H/W manual, then the slave SH will have to _wait_ until the master is done with the bus. If the master is doing lots of R/W ops through the bus, then one can easily imagine what's gonna happen.

Master: READ READ READ WRITE ALU WRITE ...
Slave: WAIT WAIT WAIT WAIT READ WAIT ...

What I'm afraid of is with this way, the master wouldn't handle a lot of thing, while the slave CPU would be under heavy loads.

Why would it be so? The master can prepare the tile array or span tree which requires some heavy math processing while the slave performs the actual rasterization.

I think I miss some skil since I didn't got anything from this. The cache is shared, so instructions and data are mixed in the cache. When could data access occur, apart from MOV ? And even if data access occur only in the leading IF stage, there is 1 IF stage / cycle, so is it such a concern ?

Yup. What follows is better explained in the SH programming manual. Basically, the SH pipeline operates like this:

IF ID EX NA WB <- leading IF stage
.. if ID EX NA WB
.. .. IF ID EX NA WB <- leading IF stage
.. .. .. if ID EX NA WB

The leading IF (instruction fetch) stage is where the instuction is actually fetched from cache. The second IF stage is in lower case because nothing is actually fetched from cache. Since the SH instructions are 16-bit words, two of them are fethed simultaneously. The first instruction is sent to the decode unit while the second one is placed in an internal register until the decode unit is free. In other words, groups of two instructions are read from cache every two cycles. Since the cache is shared between instructions and operands, the MA (memory access) stage of the pipeline can contend with the IF stage but not the "if" one. If this happen, the MA stage is given priority. This is illustrated below:

Contention ("-" = stall cycle):

IF ID EX NA WB
.. if ID EX *MA* WB
.. .. IF ID EX NA WB
.. .. .. if ID EX NA WB
.. .. .. - IF ID EX NA WB <- Contention between IF & MA
.. .. .. - .. if ID EX NA WB

6 cycles

No contention:

IF ID EX MA WB
.. if ID EX NA WB
.. .. IF ID EX NA WB
.. .. .. if ID EX NA WB
.. .. .. IF ID EX NA WB
.. .. .. .. if ID EX NA WB <- "if" pairs with MA

5 cycles

As you can see, it is very important for MOV instructions (or any other instruction that perform memory access such as LDS/STS, LDC/STC, MAC, etc) to be launched in the IF stage. Otherwise, this represents a stall cycle which is something you want to avoid as much as possible in tight inner loops

.

/* When brain power fails, brute force prevails! */

pmjobin
Jedi Master Material

Boink!

Posts: 54

32X rendering engines Mar 6, 2007 12:36:45 GMT -5

Quote

Post by pmjobin on Mar 6, 2007 12:36:45 GMT -5

BTW, I forgot to mention. If you want to have a good example of what can be done with the 32X 3D wise, have a look at this demo that was made by the Zyrinx/Lemon gurus:

www.pouet.net/prod.php?which=10686

I seriously doubt that it is possible to achieve more than that with the 32X.

edit: One of the Lemon guy (Hannibal) who worked on this demo posted a comment about it on pouet.net. Basically, he said that they're the guys behind Amok & Scorcher for the Saturn (Zyrinx also made Sub Terrania and Redzone for the Genesis). Zyrinx & Lemon comprised former members of The Silents, a legendary demo group.

Last Edit: Mar 6, 2007 18:44:34 GMT -5 by pmjobin

/* When brain power fails, brute force prevails! */

GiGaBiTe
$$$ Donald Trump Status $$$

Posts: 439

32X rendering engines Mar 6, 2007 18:25:36 GMT -5

Quote

Post by GiGaBiTe on Mar 6, 2007 18:25:36 GMT -5

pmjobin said:

BTW, I forgot to mention. If you want to have a good example of what can be done with the 32X 3D wise, have a look at this demo that was made by the Zyrinx/Lemon gurus:

www.pouet.net/prod.php?which=10686

I seriously doubt that one could achieve more than that with the 32X.

Well that was most likely the model where both the game and the rendering were done on only the SH2's. I have a feeling that you can do better if you make the 68000 do the game while the 32x does the rendering.

It was a very good demo, but it still needed texture correction so that the textures and faces wouldn't warp when you changed your view.

ob1
Moldy Popcorn

Posts: 29

32X rendering engines Mar 7, 2007 2:41:27 GMT -5

Quote

Post by ob1 on Mar 7, 2007 2:41:27 GMT -5

gigabite said:

Well that was most likely the model where both the game and the rendering were done on only the SH2's. I have a feeling that you can do better if you make the 68000 do the game while the 32x does the rendering.

That's what I'd like to do.

jlf65
$$$ Donald Trump Status $$$

Posts: 431

32X rendering engines Mar 7, 2007 16:14:18 GMT -5

Quote

Post by jlf65 on Mar 7, 2007 16:14:18 GMT -5

68K: game control and sound
MSH2: rendering
SSH2: geometry and lighting

Don't forget that geometry and lighting... that's a big part of the equation, and probably what the manual means when it says "Within 2 SH2 units, it is normal for the master to control the entire 32X and the slave to restore the computing element inside SH2 and works expecially in numerical computing."

Computing the triangle data is almost as much work as rendering the triangles, which is why modern cards do this in hardware. I feel it's better to have the two SH2's concentrate on one task rather than swap off on alternating frames because it's easier on the cache. One SH2 always has the rendering code in its cache, while the other always has the GTL code in its cache.

At that point, the 32X is behaving as a simple 3D card, with a one pipeline GPU, and a one pipeline GTE.

As for texture correction, I'd have the GTL SH2 subdivide the triangles when they get "too big" while the GPU SH2 only does affine texture mapping. That is what good games on the PSX and Saturn did as both systems only support affine texture mapping. Once you have it going, you can experiment with the threshold used to subdivide the traingles to avoid warping while minimizing the number of triangles generated.

GiGaBiTe
$$$ Donald Trump Status $$$

Posts: 439

32X rendering engines Mar 8, 2007 0:54:22 GMT -5

Quote

Post by GiGaBiTe on Mar 8, 2007 0:54:22 GMT -5

I'd maximize the availability of the 68000 for game code and send the control of the YM2612, PSG, PWM and the other sound chips to the Z80.

The thing is though, I don't know if it's possible to control the 32x PWM with the Z80 or if the Z80 even has control of anything on the 32x side of the bus, as it seems to be a slave to both the 68000 and the SH2's

ob1
Moldy Popcorn

Posts: 29

32X rendering engines Mar 8, 2007 2:44:13 GMT -5

Quote

Post by ob1 on Mar 8, 2007 2:44:13 GMT -5

The Hardware Manual states (seciton 4.3) that the Z80 has access to the 32X. I've rewritten this doc (see other subject in this forum or my site).

"Even when 32X is mapping in the 68000 address space, 68000 memory area can access each 8000h by switching banks similar to
when using the Mega Drive unit."

Especially, the Z80 has access to the PWM and the VDP.

jlf65
$$$ Donald Trump Status $$$

Posts: 431

32X rendering engines Mar 8, 2007 15:10:39 GMT -5

Quote

Post by jlf65 on Mar 8, 2007 15:10:39 GMT -5

Z80: sound
68K: game control
MSH2: rendering
SSH2: geometry and lighting

;D

If you also had the CD, you could make the secondary 68K do the game control (since it's faster and has more memory), and make the primary 68K do the sound.

GiGaBiTe $$$ Donald Trump Status $$$ Posts: 439	32X rendering engines Mar 8, 2007 18:58:10 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by GiGaBiTe on Mar 8, 2007 18:58:10 GMT -5 You could probably get both 68000's to work on game data, but there would probably be a ton of wait states since both CPUs run at different speeds.

ob1 Moldy Popcorn Posts: 29	32X rendering engines Mar 24, 2007 17:19:11 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ob1 on Mar 24, 2007 17:19:11 GMT -5 pmjobin said: Yup. What follows is better explained in the SH programming manual. Basically, the SH pipeline operates like this: (long explain for contention) Some things take time to get understood. Thank you !!!!

Post by ob1 on Mar 6, 2007 3:32:02 GMT -5

Post by pmjobin on Mar 6, 2007 12:30:11 GMT -5

Post by pmjobin on Mar 6, 2007 12:36:45 GMT -5

Post by GiGaBiTe on Mar 6, 2007 18:25:36 GMT -5

Post by ob1 on Mar 7, 2007 2:41:27 GMT -5

Post by jlf65 on Mar 7, 2007 16:14:18 GMT -5

Post by GiGaBiTe on Mar 8, 2007 0:54:22 GMT -5

Post by ob1 on Mar 8, 2007 2:44:13 GMT -5

Post by jlf65 on Mar 8, 2007 15:10:39 GMT -5

Post by GiGaBiTe on Mar 8, 2007 18:58:10 GMT -5

Post by ob1 on Mar 24, 2007 17:19:11 GMT -5