Home

OpenAI to serve up ChatGPT on Cerebras’ AI dinner plates in $10B+ deal

OpenAI says it will deploy 750 megawatts worth of Nvidia competitor Cerebras' dinner-plate sized accelerators through 2028 to bolster its inference services.

The deal, which will see Cerebras take on the risk of building and leasing datacenters to serve OpenAI, is valued at more than $10 billion, sources familiar with the matter tell El Reg.

By integrating Cerebras' wafer-scale compute architecture into its inference pipeline, OpenAI can take advantage of the chip's massive SRAM capacity to speed up inference. Each of the chip startup's WSE-3 accelerators measures in at 46,225 mm2 and is equipped with 44 GB of SRAM.

Compared to the HBM found on modern GPUs, SRAM is several orders of magnitude faster. While a single Nvidia Rubin GPU can deliver around 22 TB/s of memory bandwidth, Cerebras' chips achieve nearly 1,000x that at 21 Petabytes a second.

All that bandwidth translates into extremely fast inference performance. Running models like OpenAI's gpt-oss 120B, Cerebras' chips can purportedly achieve single user performance of 3,098 tokens a second as compared to 885 tok/s for competitor Together AI, which uses Nvidia GPUs.

In the age of reasoning models and AI agents, faster inference means models can "think" for longer without compromising on interactivity.

"Integrating Cerebras into our mix of compute solutions is all about making our AI respond much faster. When you ask a hard question, generate code, create an image, or run an AI agent, there is a loop happening behind the scenes: you send a request, the model thinks, and it sends something back," OpenAI explained in a recent blog post. "When AI responds in real time, users do more with it, stay longer, and run higher-value workloads."

However, Cerebras' architecture has some limitations. SRAM isn't particularly space efficient, which is why, despite the chip's impressive size, they only pack about as much memory as a six-year-old Nvidia A100 PCIe card.

Because of this, larger models need to be parallelized across multiple chips, each of which are rated for a prodigious 23 kW of power. Depending on the precision used, the number of chips required can be considerable. At 16-bit precision, which Cerebras has historically preferred for its higher-quality outputs, every billion parameters ate up 2 GB of SRAM capacity. As a result, even modest models like Llama 3 70B required at least four of its CS-3 accelerators to run.

It's been nearly two years since Cerebras unveiled a new wafer scale accelerator, and since then the company's priorities have shifted from training to inference. We suspect the chip biz's next chip may dedicate a larger area to SRAM and add support for modern block floating point data types like MXFP4, which should dramatically increase the size of the models that can be served on a single chip.

Having said that, the introduction of a model router with the launch of OpenAI's GPT-5 last summer should help mitigate Cerebras' memory constraints. The approach ensures that the vast majority of requests fielded by ChatGPT are fulfilled by smaller cost-optimized models. Only the most complex queries run on OpenAI's largest and most resource-intensive models.

It's also possible that OpenAI may choose to run a portion of its inference pipeline on Cerebras' kit. Over the past year, the concept of disaggregated inference has taken off.

In theory, OpenAI could run compute-heavy prompt processing on AMD or Nvidia GPUs and offload token generation to Cerebras' SRAM packed accelerators for the workload's bandwidth-constrained token generation phase. Whether this is actually an option will depend on Cerebras.

"This is a Cloud service agreement. We build out datacenters with our equipment for OpenAI to power their models with the fastest inference," a company spokesperson told El Reg when asked about the possibility of using its CS-3s in a disaggregated compute architecture.

This doesn't mean it won't happen, but it would be on Cerebras to deploy the GPU systems required to support such a configuration in its datacenters alongside its waferscale accelerators. ®

Source: The register

Previous

Next