Raja Koduri and the Inference Problem Silicon Architects Can No Longer Ignore
AI inference demand is not creeping up gradually. It is accelerating faster than most traditional design assumptions were built to handle. The architecture decisions being made right now, on which processes to use, how memory is placed, how chips are packaged, will determine performance, cost, and energy consumption for years to come.
That is the challenge Raja Koduri addressed in his keynote, Chiplet Quilting for the Age of Inference, delivered at an event organized around a single theme: AI Everywhere.
The talk was not a product pitch. It was a working argument, built from first principles, about where silicon design goes from here.
Why the Numbers Behind AI Everywhere Should Change How You Design
The Scale Is Already Here
Raja Koduri opened by grounding the AI Everywhere theme in concrete data. Token usage currently sits in the quadrillions per month. Projecting forward at conservative growth rates, inference demand is expected to reach roughly 10¹⁸ tokens per month by 2030. Even under best-case efficiency assumptions, meeting that demand points to hundreds of gigawatts of infrastructure.
That figure alone reframes the design problem. This is not a future consideration. The demand signal is already present. What remains open is how architects respond to it.
First Principles Still Decide Who Wins
Raja Koduri structured the core of his argument around the fundamentals that matter most to anyone building silicon today:
- Performance per dollar
- Performance per watt
- Flexibility across future workloads
- Packaging cost
- Energy to compute
- Energy to move data
- Energy to access memory
Physics defines the hard limits. Economics determines what can actually scale. Compute operations now cost femtojoules per bit. Data movement costs far more. Off-chip memory access dominates the energy budget. Distance matters. Where memory sits matters. How the package is assembled matters. These are not abstract concerns. They are the specific variables where efficiency is won or lost.
Chiplet Quilting: From Optional to Essential
Post-Dennard scaling has forced a set of tradeoffs that are no longer avoidable. Advanced process nodes cost more per square millimeter. Power efficiency gains are flattening. Not every function in a system belongs on the most advanced process available.
Chiplet architectures make those tradeoffs explicit and manageable.
Chiplet quilting takes the concept further. Rather than a fixed layout, the system becomes a configurable fabric. Compute, memory, and interconnect elements are treated as modular. Architects can tune for cost, power, bandwidth, and latency based on the specific demands of the workload.
The result is flexibility without losing rigor. It is a design philosophy that matches the actual shape of inference workloads rather than forcing those workloads to conform to a legacy architecture. This, according to Raja Koduri, is the point where chiplets stop being optional.
Putting Theory to Work: Oxmiq Labs and the Reference System
At Oxmiq Labs, the argument does not stop at theory. The team has built tools that allow engineers to model chiplet-based systems using real, measurable parameters, including die size and geometry, memory bandwidth, interconnect bandwidth, power consumption, cost inputs, picojoules per bit, picojoules per operation, and inference workload profiles.
Raja Koduri walked through a reference system built around the NDGX B200, a current flagship GPU platform. He then compared it with a hypothetical quilted system, one designed around tighter memory coupling and higher-bandwidth interconnects.
The comparison showed order-of-magnitude gains in throughput and order-of-magnitude reductions in energy per token.
The point was not to make a product claim. It was to show architectural leverage. When memory sits closer to compute and interconnect bottlenecks shrink, the economics of inference shift in a meaningful direction. The tools exist to model exactly how far that shift goes.
Three Ideas Worth Taking Away
Raja Koduri closed with a short, direct summary. First, the age of inference has arrived. The numbers confirm it.

Second, the details at the physics level decide who wins. Femtojoules, picojoules, memory placement, packaging geometry: these are not engineering footnotes. They are the variables where competitive outcomes are decided.
Third, chiplets are a practical path forward, and building them well is a genuinely interesting problem. For teams designing silicon for inference who want to test Oxmiq Labs’ configuration tools, the invitation is open to reach out directly.
About Raja Koduri
Raja Koduri is a chip architect and the founder of Oxmiq Labs, focused on chiplet-based system design for AI inference workloads. His keynote, Chiplet Quilting for the Age of Inference, was delivered at an event themed AI Everywhere, where he presented a framework for evaluating silicon architecture decisions based on physics fundamentals and economic scalability. You can view Raya’s keynote presentation here.

You must be logged in to post a comment.