Neural Cellular Automata (NCA) is a powerful framework for simulating the evolution of cellular structures over time where each cell's state is directly influenced by its neighbors, and it has been used in various applications such as image generation, texture synthesis, and even physics and biology simulations. However, most existing work in this area has focused on 2D cellular automata or static 3D voxel grids with limited user interaction. In this project, we aim to extend the NCA framework to 3D voxel grids and create a real-time rendering pipeline that allows for dynamic user destruction of the voxel grid. We first convert an input colored triangle mesh into a 3D voxel representation and train a 3D convolutional neural network that learns to create and regenerate this voxel representation from a minimal or damaged voxel grid. The model architecture includes three 3D convolutional layers, a layerNorm
layer and a pooling layer for dimensionality reduction. The final trained model is visualized with a custom interactive renderer built with Vispy
that allows for real-time rendering of the model output and supports user destruction of the voxel grid with the mouse cursor in order to simulate damage and regeneration.
We used the GREYC 3D colored mesh dataset which contains 15 different .PLY files. Each vertex of a mesh is represented by 3 coordinates(x,y,z) and its RGB(r,g,b). Here's an example of a few objects included in the dataset, but we chose to work with the Mario, Mario Kart, and Duck meshes.
Since our neural network trains on voxel grids and not triangle meshes, we wrote a script to convert the colored 3D mesh into voxels stored in an .NPY file. The voxelization process starts by normalizing the triangle mesh into voxel grid space in order to fit within the given resolution x resolution x resolution
voxel grid. We then create a blank 3D grid of voxels and then iterate through each triangle in the given mesh. For each triangle, we calculate the voxel bounding box that contains the triangle, then loop through each voxel in the bounding box and use barycentric coordinates to check if the voxel center lies within the triangle's vertices. If it does, we assign the color of the voxel to be the color of that given triangle. To compare multiple triangles that map to the same voxel, we simply select the color of the triangle with the largest area that the voxel center lies in to be the color of the voxel. Here's an example below of our voxelization:
We also implemented a simple FloodFill
algorithm to fill in the empty voxels inside the voxel object. The FloodFill algorithm starts at an exterior boundary voxel and uses BFS to find all the connected voxels that are not already filled (essentially finding the air outside the object). We then take the inverse of these "air" voxels and the filled voxels with inside_filled = ~flood_fill & ~filled
to fill in the empty voxels inside the object, and we assign these inside voxels a flesh colored pink color of (255, 200, 200)
.
Each voxel is built on 16 input channels: the first 4 are, in order, corresponding to RGBA values. The other 12 can be thought of as "hidden states" that convey information to their neighbours each update. The model is built on three 3D-convolutions. The intuition behind the architecture is to first perceive from the sorroundings, and pool information from the 3x3x3 grid of neighbouring voxels. Next up, after a LayerNorm (for regularization purposes), we process the pooled information with layers with kernal size 1, eventually shrinking dimensionality to our desired output.
Initially, we trained our model to learn to grow — it started with a single black voxel (hidden state learned), and we optimized it to be able to construct the full mesh within 16-64 iterations (number of iterations is sampled uniformly). The model learns to do this relatively quickly, but does not learn to maintain the voxel grid — within a few more iterations, the voxel grid often degenerates into chaos. So, in the next stage of training, we start from the voxel grid created by the model, and optimize it to be able to maintain the voxel grid — this way, the model learns to grow our voxel grid, and maintain it. Now for the most interesting part: we made our voxel grid resilient to damage. This stage of training consists of randomly corrupting portions of the voxel grid, and training our model to be able to reconstruct these portions, resulting in a dynamic, living 3D object.
We built a curriculum to be able to manage all these learning tasks, while still preventing catastrophic forgetting: every curriculum would add 64 iterations to the last one. So, 0->64, 64->128... upto 1024.
Our model's loss function consists of 3 factors: undergrowth, overgrowth, and stability. Stability is on a linear schedule, since we want the model to just learn to grow initially. (Weights: undergrowth at 1, overgrowth at 10, and stability from 0->10).
![]() |
![]() |
![]() |
Once the model has stabilized, we can visualize our NCA with a custom interactive GUI built with VisPy
and PyQt
. Vispy
is a high-performance Python
library powered by OpenGL, ideal for rendering large 2D and 3D visualizations like voxel grids. Its compatibility with PyTorch
and PyQt
made it well-suited for integrating real-time model inference with an interactive GUI. To get the interactive renderer working, we had to implement several key components, including voxel rendering, camera control, and mouse-based interaction:
VisPy
canvas with a turntable camera to allow for interactive zooming and rotation. Next, we loaded our PyTorch
model from the trained checkpoint set up a simulation function that would run model inference at every time step and outputs a 4D NumPy
array of shape (X,Y,Z,4)
.
X,Y,Z
represents the spatial dimensions of the voxel grid and 4
represents the colors channels R,G,B,A
where A
is the alpha channel to determine the opacity of the color. To determine which voxels were "alive" at each time step, we used a simple thresholding method to determine whether the alpha channel of each voxel was above a certain "alive threshold" value, and only grabbed the R,G,B
colors of the alive voxels. Markers
library. Although it was simple and easy to implement, the point cloud wasn't up to par with the desired rendering quality as pictured below:Box
object, but rendering \(32^3\) individual cubes created a lot of lag. To improve the rendering speed, we decided to batch all the voxels together into a "mesh" and use the MeshVisual
library and update the mesh data at each time step. This allowed us to create a very fast and visually appealing rendering (local on the CPU) while still maintaining the cube look, as shown below:
MeshVisual
voxel object in order to ensure all faces of the object were uniformally lit. This caused the coloring of some voxels to look overly saturated, so we dialed down the saturation by converting the RGB colors to HSV, reducing the saturation, and converting back to RGB.
view.scene.transform
object, which represents the current mapping between scene and screen coordinates in VisPy
, and leveraged the inverse transformation view.scene.transform.imap
matrix to transform points from the screen to world.
(x,y)
mouse position in screen coordinates and create two homogeneous coordinates to represent a near point and a far point on the viewing z-axis where 0 = near and 1 = far.
\[p_{near} = (x,y,0,1), p_{far} = (x,y,1,1)\]
We apply the imap
inverse transformation to both points to get the 3D coordinates of the near and far points in world coordinates, and set the ray origin to be the near point and the direction to be the normalized difference between the far and near points.
\[ray_{origin} = imap(p_{near}[:3]), ray_{direction} = \frac{imap(p_{far}[:3]) - imap(p_{near}[:3])}{|imap(p_{far}[:3]) - imap(p_{near}[:3])|}\]
We tested our transformation by drawing the resulting ray in 3D space, and it correctly aligned with our mouse clicks.
PyQt
to improve the user experience of the interactive renderer. We added a button to toggle between different models to visualize different objects at different resolutions, and also added a play, pause, and reset button to control the simulation. The play button starts the simulation and runs the model inference at each time step, while the pause button stops the simulation and allows the user to interact with the voxel grid. The reset button resets the voxel grid to its original state and stops the simulation.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Method | Time (ms) | Speedup vs. Baseline |
---|---|---|
Baseline | 184 | 1.0× |
INT8 Quantization | 651.5 | 0.28× |
FP16 Quantization | 42.8 | 4.29 |