Inference Mode

Next Event

On May 3rd, we will meet to discuss the DFlash paper by Z-Lab.


Sign up to attend here.


Can't make it? Sign up for our mailing list to hear about future events.

What is inference mode?

Inference Mode is a reading group for performance-focused engineers building inference-heavy systems.

"Inference mode" denotes a state of mind: ruthless optimization of kernels, tireless removal of host-side bottlenecks, unhinged hacks to skip computations or compress tensors, and making high-performance accelerators go brrrt.

All to deliver the outputs of large neural networks at scale and under SLO.

It is named by analogy to CUDA MODE and in reference to the @inference_mode decorator that prepares a PyTorch model for inference.