N3CTAR: Neural 3D Cellular Tessellated Automata Rendering

Team 44: Annabel Ng, George Rickus, Henry Ko, Samarth Jajoo

Abstract

Neural Cellular Automata (NCA) is a powerful framework for simulating the evolution of cellular structures over time where each cell's state is directly influenced by its neighbors, and it has been used in various applications such as image generation, texture synthesis, and even physics and biology simulations. However, most existing work in this area has focused on 2D cellular automata or static 3D voxel grids with limited user interaction. In this project, we aim to extend the NCA framework to 3D voxel grids and create a real-time rendering pipeline that allows for dynamic user destruction of the voxel grid. We first convert an input colored triangle mesh into a 3D voxel representation and train a 3D convolutional neural network that learns to create and regenerate this voxel representation from a minimal or damaged voxel grid. The model architecture includes three 3D convolutional layers, a layerNorm layer and a pooling layer for dimensionality reduction. The final trained model is visualized with a custom interactive renderer built with Vispy that allows for real-time rendering of the model output and supports user destruction of the voxel grid with the mouse cursor in order to simulate damage and regeneration.

Technical Approach

1. 3D mesh to 3D Voxel Pipeline

We used the GREYC 3D colored mesh dataset which contains 15 different .PLY files. Each vertex of a mesh is represented by 3 coordinates(x,y,z) and its RGB(r,g,b). Here's an example of a few objects included in the dataset, but we chose to work with the Mario, Mario Kart, and Duck meshes.

Since our neural network trains on voxel grids and not triangle meshes, we wrote a script to convert the colored 3D mesh into voxels stored in an .NPY file. The voxelization process starts by normalizing the triangle mesh into voxel grid space in order to fit within the given resolution x resolution x resolution voxel grid. We then create a blank 3D grid of voxels and then iterate through each triangle in the given mesh. For each triangle, we calculate the voxel bounding box that contains the triangle, then loop through each voxel in the bounding box and use barycentric coordinates to check if the voxel center lies within the triangle's vertices. If it does, we assign the color of the voxel to be the color of that given triangle. To compare multiple triangles that map to the same voxel, we simply select the color of the triangle with the largest area that the voxel center lies in to be the color of the voxel. Here's an example below of our voxelization:

We also implemented a simple FloodFill algorithm to fill in the empty voxels inside the voxel object. The FloodFill algorithm starts at an exterior boundary voxel and uses BFS to find all the connected voxels that are not already filled (essentially finding the air outside the object). We then take the inverse of these "air" voxels and the filled voxels with inside_filled = ~flood_fill & ~filled to fill in the empty voxels inside the object, and we assign these inside voxels a flesh colored pink color of (255, 200, 200).

2. 3D Cellular Automata Neural Network

Each voxel is built on 16 input channels: the first 4 are, in order, corresponding to RGBA values. The other 12 can be thought of as "hidden states" that convey information to their neighbours each update. The model is built on three 3D-convolutions. The intuition behind the architecture is to first perceive from the sorroundings, and pool information from the 3x3x3 grid of neighbouring voxels. Next up, after a LayerNorm (for regularization purposes), we process the pooled information with layers with kernal size 1, eventually shrinking dimensionality to our desired output.

Initially, we trained our model to learn to grow — it started with a single black voxel (hidden state learned), and we optimized it to be able to construct the full mesh within 16-64 iterations (number of iterations is sampled uniformly). The model learns to do this relatively quickly, but does not learn to maintain the voxel grid — within a few more iterations, the voxel grid often degenerates into chaos. So, in the next stage of training, we start from the voxel grid created by the model, and optimize it to be able to maintain the voxel grid — this way, the model learns to grow our voxel grid, and maintain it. Now for the most interesting part: we made our voxel grid resilient to damage. This stage of training consists of randomly corrupting portions of the voxel grid, and training our model to be able to reconstruct these portions, resulting in a dynamic, living 3D object.

We built a curriculum to be able to manage all these learning tasks, while still preventing catastrophic forgetting: every curriculum would add 64 iterations to the last one. So, 0->64, 64->128... upto 1024.

Our model's loss function consists of 3 factors: undergrowth, overgrowth, and stability. Stability is on a linear schedule, since we want the model to just learn to grow initially. (Weights: undergrowth at 1, overgrowth at 10, and stability from 0->10).

Mario Epochs 1000 — Mario before stabilization (32x32x32)

Best Mario after stabilization (32x32x32)

Big Kart Results — Big Mario Kart Results (64x64x64)

3. Interactive Voxel Rendering and Model Evaluation

Once the model has stabilized, we can visualize our NCA with a custom interactive GUI built with VisPy and PyQt. Vispy is a high-performance Python library powered by OpenGL, ideal for rendering large 2D and 3D visualizations like voxel grids. Its compatibility with PyTorch and PyQt made it well-suited for integrating real-time model inference with an interactive GUI. To get the interactive renderer working, we had to implement several key components, including voxel rendering, camera control, and mouse-based interaction:

Rendering the voxel grid

We first set up our VisPy canvas with a turntable camera to allow for interactive zooming and rotation. Next, we loaded our PyTorch model from the trained checkpoint set up a simulation function that would run model inference at every time step and outputs a 4D NumPy array of shape (X,Y,Z,4).

X,Y,Z represents the spatial dimensions of the voxel grid and 4 represents the colors channels R,G,B,A where A is the alpha channel to determine the opacity of the color. To determine which voxels were "alive" at each time step, we used a simple thresholding method to determine whether the alpha channel of each voxel was above a certain "alive threshold" value, and only grabbed the R,G,B colors of the alive voxels.

Once we had the coordinates and colors of the alive voxels, we originally tried rendering the grid as a point cloud using the built-in Markers library. Although it was simple and easy to implement, the point cloud wasn't up to par with the desired rendering quality as pictured below:

Next, we tried rendering each indiviudal voxel as a Box object, but rendering \(32^3\) individual cubes created a lot of lag. To improve the rendering speed, we decided to batch all the voxels together into a "mesh" and use the MeshVisual library and update the mesh data at each time step. This allowed us to create a very fast and visually appealing rendering (local on the CPU) while still maintaining the cube look, as shown below:

We removed shading on the MeshVisual voxel object in order to ensure all faces of the object were uniformally lit. This caused the coloring of some voxels to look overly saturated, so we dialed down the saturation by converting the RGB colors to HSV, reducing the saturation, and converting back to RGB.

Transforming cursor clicks to 3D space

After handling basic rendering of the NCA model, the next step was to add user interaction with the grid through clicking. The goal was to transform a 2D cursor click into a ray in 3D space, but this was complicated by the interactive camera. However, we were able to use the built-in view.scene.transform object, which represents the current mapping between scene and screen coordinates in VisPy, and leveraged the inverse transformation view.scene.transform.imap matrix to transform points from the screen to world.

To apply this transformation, we first get the (x,y) mouse position in screen coordinates and create two homogeneous coordinates to represent a near point and a far point on the viewing z-axis where 0 = near and 1 = far. \[p_{near} = (x,y,0,1), p_{far} = (x,y,1,1)\] We apply the imap inverse transformation to both points to get the 3D coordinates of the near and far points in world coordinates, and set the ray origin to be the near point and the direction to be the normalized difference between the far and near points. \[ray_{origin} = imap(p_{near}[:3]), ray_{direction} = \frac{imap(p_{far}[:3]) - imap(p_{near}[:3])}{|imap(p_{far}[:3]) - imap(p_{near}[:3])|}\] We tested our transformation by drawing the resulting ray in 3D space, and it correctly aligned with our mouse clicks.

Testing Ray Casting from cursor

Handle ray intersections with the voxel grid and damaging the voxel grid

To handle ray intersections with the voxel grid, we used a simple ray-casting algorithm to check for intersections between the ray and the voxel grid. Given the direction and origin of the mouse click ray, we iterate through all alive voxels, compute the voxel center, project that voxel center onto the ray, and measure the shortest distance from the voxel center to the ray using the ray equation described in lecture. \[p(t) = ray_{origin} + t \cdot ray_{direction}\] If the distance is within a threshold radius of the voxel center, we consider that voxel to be "hit" by the ray, and we take the closest of the hit voxels to be destroyed. Destruction is handled by zeroing out the voxel grid and living mask for the 6x6x6 cube of voxels around the hit voxel, which erases all features within that region. This allows us to visualize how the NCA model regenerates the voxel grid in the destroyed region. Destruction mode is enabled by holding down the "d" key, which also disables the camera controls. We also implemented support for click and drag to destroy multiple voxels at once by keeping track of mouse movement and checking for intersections with the ray at each time step.

Example of Voxel Damage and Regeneration

User experience

We also added various buttons with PyQt to improve the user experience of the interactive renderer. We added a button to toggle between different models to visualize different objects at different resolutions, and also added a play, pause, and reset button to control the simulation. The play button starts the simulation and runs the model inference at each time step, while the pause button stops the simulation and allows the user to interact with the voxel grid. The reset button resets the voxel grid to its original state and stops the simulation.

Results

Best Mario (32x32x32)	Duck (32x32x32)
Mario Kart (32x32x32)	Bald Mario - flood fill blooper (32x32x32)
Big Mario (64x64x64)	Big Duck (64x64x64)
Big Mario Kart (64x64x64)

4. Training Infrastructure and Quantization

GPU Infrastructure

Half Precision Training

Post-Training Int8 Quantization

Inference Time Comparison Across Precision Levels
Method	Time (ms)	Speedup vs. Baseline
Baseline	184	1.0×
INT8 Quantization	651.5	0.28×
FP16 Quantization	42.8	4.29

Post-Training FP16 Quantization

The final approach we tried was running post-training quantization to FP16 instead of Int8. We expected this would accelerate computations by relying on Apple's Neural Engine and GPUs while maintaining manageable generation quality. However, when we ran the quantization to FP16, the model quality was degraded. We expect the performance degradation coming from the quantization also being applied to the LayerNorm layer. The model inference speed was indeed faster, reaching speeds of 4.29x compared to our baseline. Here are some example outputs comparing all approaches below.

References

Contributions

Annabel Ng: Developed the 3D mesh to 3D voxel pipeline, debugged the flood fill algorithm, and implemented all of the interactive voxel rendering

George Rickus: Focused on model training and figured out how to grow and maintain the voxel grid while also making the voxel grid resilient to damage

Henry Ko: Debugged the 3D mesh to 3D voxel pipeline and focused on setting up GPU infrastructure for training all these models and experimented with quanitzation

Samarth Jajoo: Focused on model training and supported George in figuring out how to stabilize the voxel grid