NVIDIA Supercharges GPT-OSS from Cloud to Edge with 1.5M TPS Performance

Riya
Aug 11
0
0
0

News

The partnership between NVIDIA and OpenAI traces its roots back to 2016, when NVIDIA’s CEO, Jensen Huang, famously hand-delivered the first NVIDIA DGX-1 AI supercomputer to OpenAI’s headquarters in San Francisco. This historic gesture marked the start of a deep collaboration that continues to redefine the possibilities in the world of artificial intelligence, providing the technologies and know-how required for massive-scale AI breakthroughs.

Launching the Next Generation: gpt-oss-20b and gpt-oss-120b

Building on their legacy, NVIDIA and OpenAI have now unveiled two cutting-edge open-weight large language models, gpt-oss-20b and gpt-oss-120b. These models represent a new benchmark for open-source reasoning models, delivering high performance on domain-specific reasoning, agentic workflows, and tool-calling tasks.

gpt-oss-120b rivals commercial models on core reasoning benchmarks, yet can run efficiently on a single 80 GB NVIDIA GPU, powering use cases that require low latency and high security.
gpt-oss-20b is a more accessible yet capable variant, runnable even on high-end consumer devices with at least 16 GB of VRAM, perfect for on-device rapid iteration and local deployment.

Both models are architected with a mixture-of-experts (MoE) design, enable chain-of-thought reasoning, and are optimized for efficient open-source deployment across a range of hardware.

Built for Speed: Performance on Blackwell Architecture

The gpt-oss models have been meticulously tuned to exploit NVIDIA’s advanced Blackwell architecture. With features like second-generation Transformer Engines, FP4 Tensor Cores, and the blazing-fast NVLink-5, the Blackwell-powered GB200 NVL72 system can deliver up to 1.5 million tokens per second, enabling about 50,000 users to access large-scale inference in parallel. The 4-bit FP4 mode used by Blackwell also doubles throughput and effective model size compared to previous generations.

Developer Ecosystem and Framework Optimizations

NVIDIA has ensured the gpt-oss models are accessible and performant across a wide software ecosystem. Developers can rely on a vast library of optimized kernels and frameworks, including:

TensorRT-LLM: Empowers efficient inference with precision tuning, kernel fusion, prefill/decoding acceleration, and MoE routing, drastically reducing compute cost and latency.
FlashInfer: Delivers advanced MoE routing and attention optimizations for serving large models in production.

vLLM, Hugging Face Transformers, Ollama, and more: Broad compatibility ensures that teams can deploy using their preferred open-source tools, both in the data center and on local machines.

Seamless Deployment Options: Cloud, Data Center, and Edge

NVIDIA Launchable: Developers can try the models instantly in GPU-accelerated, pre-configured cloud environments with a single click, eliminating infrastructure setup headaches.
NVIDIA Dynamo: This open-source inference framework is purpose-built for large-scale, multi-node LLM serving, offering advanced routing, autoscaling, and disaggregated serving for ultra-high throughput and efficiency.
On-premise and Local PCs: gpt-oss-20b is deployable on virtually any modern GeForce RTX or RTX PRO-powered workstation, enabling fast, private, local AI workflows—even on consumer hardware.

Simplifying Enterprise AI with NVIDIA NIM

For enterprise developers, NVIDIA NIM provides secure, flexible microservices for deploying the GPT-OSS models as managed APIs. NIM supports integration with any GPU-accelerated infrastructure, putting data privacy and enterprise-class security at the forefront.

Why This Matters: Accelerating the Open AI Revolution

This joint release marks a significant step toward democratizing advanced AI. The gpt-oss-20b and gpt-oss-120b models are not just high performers; they offer unmatched accessibility through a rich suite of developer tools, flexible deployment options, and robust open-source licensing (Apache 2.0). As organizations worldwide seek both cutting-edge AI capabilities and open, customizable solutions, these models are set to accelerate innovation across industries.

Developers can explore, deploy, and integrate these open-weight models today using the NVIDIA API Catalog, Launchable one-click environments, and comprehensive documentation from NVIDIA and OpenAI, setting the stage for the next wave of breakthrough AI applications.