In case you missed the last big update on Discord:
Hello @here, today I will give a progress update on our custom rendering engine CRP.
The rework is done, and it is now much more future-proof than it has ever been and currently runs on Vulkan, which allows for cross-platform support (Windows, Linux and MacOS). The architecture implements a custom frame-graph compiler, which optimizes each frame to deliver the maximum performance. This will allow us to more easily implement new rendering features, and will also make it possible to port existing Unity shaders to CRP.
The most recent change involves the rendering pipeline built on top of CRP to actually render the game. If you think of CRP as an engine with the main purpose of making things run quickly, you can view the rendering pipeline as the car built around the engine, with the main focus on aesthetics and functionality. The following explanation will be split into several parts, each going more and more in-depth. Feel free to read until it gets too complicated and don't worry, it might get very complex and not reading to the end is perfectly fine.
Level 1
The rendering pipeline is split into a traditional and a ray tracing part. The traditional one can run anywhere, and it is not the focus of this update (it employs DenseLOD and other optimizations and is quite functional). The ray tracing part has been the focus lately now because it will allow us to render beautiful scenes and create visual effects that are just not possible traditionally. To make ray tracing efficient, we will employ a cache to store scene lighting information, and it is what this update is about. The prototype implementation is now done, and the next step is to implement the cache fully and layer ray tracing effects on top of it (global illumination, reflection, refraction, ReSTIR, ...).
Level 2
The cache is a neural radiance cache and employs a simple multilayer perceptron with a special input encoding. The input encoding is implemented according to this paper by NVidia:
https://arxiv.org/abs/2201.05989. They also provide an official implementation here
https://github.com/NVlabs/instant-ngp (unfortunately it is written in CUDA and not HLSL/GLSL, which is why a custom implementation is needed). While it most known for the ability to reconstruct scenes from simple pictures (we might also use that for things in the future), it is also very good at learning the lighting in a scene and therefore functioning as a radiance cache.
Level 3
Tracing rays is expensive! There is only a limited budget of rays that can be traced per frame, and it gets even tighter on GPUs from the previous generations. To fully ray trace a scene at 1080p with a single light and only shadows (no reflections, GI, nothing), we already need 1920*1080*2 = 4 million rays per frame. With 10 lights, that is already ~40 million rays per frame. You can probably see how that can get very expensive. Adding additional effects only increases the load, and the 40 MegaRays/frame is already really close to the limit. Typically, games employ a number of technologies like sampling (introduces noise), denoising (to remove the noise again), upscaling and temporal light accumulation. Global illumination can be solved by technologies like DDGI (there are countless others), but they all have various limitations compared to the neural radiance cache. Specifically, to make CRP run at good frame rates, we need to trace as few rays as possible, which is where the radiance cache becomes relevant. We trace many "shallow" rays and "terminate" them into the cache (the color value at the pixel is determined by the cache and not yet another traced ray) and then trace few "deep" rays with many bounces and use them to train the cache. This approach is also further explained in the instant-ngp video.
Level 4
To make the radiance cache a useful alternative, it needs to run fast, really, really fast. The current implementation is still suboptimal and could be classified as an "alpha" in terms of optimization, but it already runs quite fast. Training the neural network takes 2-3 milliseconds, and sampling a full 1080p picture from the cache takes 5-7 ms. This easily allows for interactive framerates and the goal is to further reduce the sampling time. To optimize the billions of math operations required to train and sample the cache, a fully fused MLP was chosen, as described by the instant-ngp paper. The HLSL code was heavily optimized to maximize cache locality and minimize the time spent waiting between math operations.
There is a lot of additional information that I did not include here for complexity and length reasons. If you do still have questions, I am happy to answer them. Additionally, NVidia has some great resources on their research.
The picture roughly shows how the AI learns the input image over time. The image is currently only a test to show the implementation works. In the future, the AI would learn based on 3-D world coordinates and surface direction vectors.