tykkiman's N64 homebrew site

Visibility and overdraw

I’m writing this explainer because I have studied the subject a lot but haven’t yet applied my learnings to anything publicly released, so there’s real risk all of it gets stuck in my head. I’m hoping this post will inspire you to explore this wonderful subproblem of computer graphics and make an awesome N64 homebrew breakthrough of your own :) –tykkiman


The Nintendo 64 is a shared memory system. Anything you do on it will be competing on access to the same bus to access RDRAM. That means RDP drawing pixels to a framebuffer will slow down gameplay code running on the VR4300 CPU, even though intuitively you’d expect them to have no connection.

Especially deadly, performance-wise, is overdraw. It simply means drawing to the same pixel multiple times so that only the last color we wrote stays visible. The earlier writes go to waste without any effect on final image quality. So how do you avoid overdraw, and if you can’t, make it less of a problem?

Avoiding overdraw with culling of different sorts

Here “culling” means removing some objects, parts of objects, or individual triangles from the set of things to draw. Let’s look at them in order of increasing complexity. To be clear, only occlusion culling actually avoids drawing to the same pixel multiple times. Backface and frustum culling just reduce the number of triangles sent to the RDP, but those would’ve been discarded anyway.

Backface culling

Backface culling removes polygons whose visible side points away from camera. On the N64 this is usually done in 3D microcode during triangle setup. On modern systems it’s possible to do hierarchical backface culling by splitting your objects into small “clusters” and precomputing a cone per cluster inside of which all clusters triangles are visible. That could actually be a great fit for the N64 too!

Meshlets

A meshlet experiment. A mesh is split into smaller meshlets that store their own surface normal and visibility cone radius so that they can be frustum culled more efficiently. Screenshot from a mesh viewer running on PC.

Frustum culling

Frustum culling hides things behind the camera and outside of its field of view. The word “frustum” refers to the pyramid-with-a-cut-top shape that represents camera’s field of view volume. Per-triangle frustum culling is automatically performed by the 3D microcode as a side effect of clipping, which is necessary anyway to produce a correct image without cracks.

See this example of a very dense stress test mesh:

Frustum culling. Left: camera view, right: what gets actually drawn.

Usually per-object frustum culling is what gets talked about because it’s much more effective. In practice it means precomputing some bounding shapes, such as spheres or boxes, for your objects and testing those for intersection against each of the six planes of the camera frustum. I’ve written this myself with the help of cglm library’s glm_frustum_planes in extracting the planes and a naive bounding sphere computed for each object. The sphere vs plane test is simple and for example the three.js library uses this method.

If you have a static environment that consists of many individual objects, you could speed up its frustum culling by doing it hierarchically. You can put all objects in a bounding volume hierarchy (BVH) (think putting object bounding boxes inside larger boxes) and then quickly cull away large groups of objects at once.

Occlusion culling

Occlusion culling is complex. It means hiding objects that are behind (“occluded by”) other objects in the scene, from camera’s perspective of course. It’s a hard problem and there isn’t a one-size-fits-all solution to it. Well, nowadays hierarchical Z-culling gets pretty close and is used for example in Unreal Engine 5. But that is a no-go on the N64 because it’s so computationally intensive. We are left with three options:

Portals are easy to author by hand in interior area game levels and is probably the occlusion culling method of choice of all developers used on the N64, if any. Potentially visible sets are pretty much just an optimization on portals and were popular because Quake ended up using them. Still, portals should be easier to code since no analytical visibility computations are needed (just use screenspace bounds) and can cull individual objects more effectively. This is why Thief (1998) could have more objects per room than Quake (1996).

What doesn’t work in occlusion culling

The problem statement “find out which objects are occluded by others” has one important subtlety: objects can be partially covered by many others and still end up fully occluded. If an occlusion culling method does this correctly it’s said to do occluder fusion. Portal rendering, potentially visible sets and hierarchical Z-buffering all perform occluder fusion, which is why they are so useful. Most other past algorithms do not and only operate on pairs of objects when doing visibility checks.

Making overdraw less of a problem

Of course you can write more cache-friendly code so that contention on the Big Memory Bus doesn’t slow down your game that much. That’s easier said than done though, and I don’t claim to know how to do it on the N64. We could try to apply data-oriented design principles here even though they were originally written with Xbox 360 and Playstation 3 in mind. The CPUs in N64 and the Xbox 360 bear some similarities: They both have in-order, pipelined execution and don’t like branch misses. So the same principles, “avoid cache misses and branches”, should apply to both the Xbox 360’s IBM Xenon and the NEC VR4300.

Avoiding the Z-buffer?

If drawing pixels is faster, you can afford more overdraw. Draw objects without Z-buffering and you’ll get a 20% performance boost. So you’ll be solving the visibility problem with the painter’s algorithm instead by drawing polys back to front.

It’s pretty hard to do though because you need to sort every polygon of each visible object, every frame. In theory you could use a binary space partitioning (BSP) tree to draw triangles back to front, but according to my experiments it’s not that simple. It’s easy to end up with too many vertices after splits done when building the tree. Then vertex transforms and lighting become the bottleneck. You can’t batch draws by texture either without a Z-buffer so you’ll end up swapping textures constantly which causes bubbles in the RDP pipeline and there goes your performance wins.

Worst BSP tree ever! Each leaf has a unique color for visualization.

I’m not saying BSPs can’t work on the N64 as a Z-buffer replacement with some care but I’m not there yet. Probably the best tradeoff is to sort each object and then sort each objects tris with a BSP or some other way. Overlapping objects can revert back z-buffering, like Rasky suggested on Discord.

Ordering tables

Way over there in the Playstation side they sorted polygons by Z with an “ordering table”. Conceptually it’s an array of lists. Each polygon is added to an array index corresponding to its average Z coordinate. We maintain an back-to-front order in each list. Then polygons can be drawn back-to-front by drawing every list of every array index in order. You can also sort whole objects instead of individual polygons.

And it doesn’t really work for sorting polys on the N64 as it requires per-polygon Z to be available on the CPU. Vertex transforms are usually done on the RSP so you’d have to transfer screen space transformed vertices back to RDRAM for sorting. And you can’t really afford to keep the linked list in that 4 KiB of DMEM either. But it seems good for object sorting.

More reading

Check out Optimized View Frustum Culling Algorithms for Bounding Boxes for some tricks how to speed up hierarchical frustum culling.

Sean Barret’s The 3D Software Rendering Technology of 1998’s Thief: The Dark Project has some fascinating details of Thief’s BSP renderer and other methods.