Inference in AI is evolving, demanding multi-step reasoning and long-horizon context for tasks like software development and video generation. NVIDIA’s SMART framework optimizes inference performance with platforms like Blackwell and Rubin CPX GPU, designed for long-context workloads with high efficiency and ROI.

Disaggregated inference tackles the complexity of AI by processing context and generation phases independently, optimizing compute and memory resources. NVIDIA Dynamo orchestrates components like KV cache transfers for improved throughput and latency reduction, setting new performance records in MLPerf Inference results with GB200 NVL72.

NVIDIA introduces Rubin CPX GPU for high-throughput performance in long-context inference workloads, enhancing efficiency and ROI in software development and video generation. Rubin CPX works alongside Vera CPUs and Rubin GPUs for complete processing solutions, delivering 8 exaFLOPs of NVFP4 compute in a single rack for large-scale generative AI workloads.

The Vera Rubin NVL144 CPX rack, powered by Rubin CPX GPUs and Vera CPUs, offers a high-performance disaggregated serving solution for million-token context AI inference workloads. With up to 50x ROI and $5B revenue potential, this platform redefines inference economics, unlocking advanced capabilities for developers and creators worldwide.

Read more at Nvidia: NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads