AWS Made Random Graph Theory the Default for New Data Centers

Amazon Web Services has made random graph theory the default network design inside its data centers. The architecture, called Resilient Network Graphs or RNG, was live in production by the end of 2024 and had become the default for most new AWS builds globally by April 2026, according to Amazon’s own paper and the team that built it at Amazon Science. RNG replaces the fat-tree topology that has organized hyperscale data centers for decades, and it does so on the same commodity routers, fiber, and optics AWS already buys.

The numbers Amazon reports are concrete. RNG uses 69% fewer routers, lifts throughput by up to 33%, and projects a 40% drop in network equipment electricity consumption. The same routing layer runs without customer workload changes. For a company whose outages take a large share of the internet down with it, the design is also a resilience play: in a random network, a 1% loss of routers costs roughly 1% of capacity, and the degradation stays proportional and predictable.

The Network That Carries 28% of the Cloud

AWS held 28% of the global cloud infrastructure market in the first quarter of 2026, ahead of Microsoft at 21% and Google Cloud at 14%, according to Q1 2026 cloud market share data. Any network change inside that footprint reaches a large share of the public internet in short order. The same economies of scale that make AWS attractive also mean an outage at any one region is felt across the web, and that pressure has pushed AWS toward architectures that fail more gracefully.

RNG is the new architecture. It connects commodity routers to each other directly, in a quasi-random pattern, instead of stacking them in the strict layers of a fat tree. The same routers, fiber, and optics stay in place, and what changes is how they are cabled and how packets are routed. AWS calls the routing algorithm Spraypoint and the passive optical device that holds the cabling together the ShuffleBox. The combined design went live first in Ireland at the end of 2024, and quasi-random wiring has since rolled out to most new AWS data centers globally.

Why Fat Tree Stopped Scaling

A fat tree is a layered hierarchy. Bottom-layer routers, the ones that connect to server racks, link upward through aggregation and spine routers, and a packet climbs the tree to find its destination before traveling back down. The structure is easy to implement and easy to reason about, which is why every hyperscaler adopted it. The cost is rigid capacity. Traffic between two racks is forced onto a small slice of links, and the rest of the fabric sits idle, even when the workload would happily use it.

The other cost is fragility. A spine router sits at the top of the tree and carries traffic for many endpoint pairs. A single hardware fault can halve the available capacity for most of the fabric, and AWS builds redundancy on top of this design at extra cost. Random graph theory has long been the proposed escape from both the rigidity and the single points of failure. The math was settled. The problem was that no one had been able to build it at scale.

  • The fat tree has been the workhorse of hyperscale networks for decades, per the AWS research paper.
  • 1% loss of spine routers in a fat tree can halve available capacity for many endpoints.
  • 1990s mathematics showed random topologies outperformed trees on resilience, in theory.

The shift to RNG is, in effect, AWS cashing the check those proofs wrote. The cost line is a fat tree with a redundant overlay. The replacement is a flat fabric with no special routers to fail.

A 1990s Theory Finds a Real Data Center

In the early 1990s, mathematicians showed that the optimal network for routing has a random topology, in which each router connects to a few others chosen at random. The result is counterintuitive but the network ends up with lots of different paths between all pairs of routers, and no router is more important than any other. AWS researchers cite this work as the foundation for RNG.

For decades, the theory could not leave the chalkboard. Random networks need a routing protocol that decides how packets reach their destinations, and computing the right paths in a fully random topology takes far more memory than commodity switches carry. AWS puts the gap at 20 to 80 times more memory than commodity hardware ships with. Cabling was the bigger problem. A real data center would need millions of fiber connections, and you cannot manually wire a random graph at scale.

The break came from a Slack message in 2023. Ratul Mahajan, an Amazon Scholar and University of Washington professor, posted: “Looking for someone with expertise in graph theory and routing.” Seshadhri Comandur, an Amazon Scholar and University of California, Santa Cruz mathematician who worked on networks in the abstract, replied: “Yeah, I know something about that.”

It was typical for academia. Everybody’s excited, but then the real world hits.

That was Giacomo Bernardi, the third lead and an AWS principal applied scientist, describing the gap between mathematical theory and the inside of a hyperscale data center. Bernardi, Mahajan, and Comandur went on to lead the team that closed that gap, working alongside optical engineers, data center designers, and the AWS networking organization.

Spraypoint Routes Traffic, ShuffleBox Cables It

Spraypoint splits routing into two phases. The source router sprays its traffic randomly to all of its neighbors, and each packet then travels by shortest path to a designated waypoint that feeds it to its destination. Waypoints fan traffic out and prevent packets from piling up on a single link near the destination. Because the spraying happens on commodity routers, the protocol runs on hardware AWS already owns, with no specialized CPUs or extra memory. AWS reports that Spraypoint provides nearly twice as many independent paths between routers as standard shortest-path routing techniques.

The cabling side runs on a piece of hardware AWS calls a ShuffleBox, a sealed enclosure with no power supply that takes in fiber on one side and shuffles the connections internally according to a fixed pattern. When ShuffleBoxes are then connected to each other randomly, the result is a quasi-random graph that looks like spaghetti but does not need to be wired like spaghetti. The ShuffleBox adds no latency and no failure surface, and a new rack plugs into a local port without rewiring anywhere else in the fabric.

Attribute Fat Tree AWS RNG
Router count Baseline 69% fewer
Throughput Baseline Up to 33% higher
Network equipment power Baseline 40% lower
Cost vs equivalent oversubscription Baseline 9 to 45% cheaper
Router failure impact Spine failure halves capacity for many endpoints 1% router loss = roughly 1% capacity loss
Commodity router support Yes Yes

The two pieces together turn the abstract idea into something AWS can install on commodity hardware, with no specialized routers or new fiber plant required. The validation covered transport and application layer benchmarks, and the fabric performed on par with fat trees across multipath-transport workloads and latency-sensitive storage. The rollout is, by Amazon’s own description, the first deployment of a flat, random-graph data center network at hyperscale. Amazon describes the work as the first ever scalable flat-network datacenter, in a technical paper on RNG posted to arXiv.

The Numbers AWS Is Reporting

Amazon’s figures, drawn from its arXiv paper and Amazon Science write-ups, line up consistently. RNG uses 69% fewer routers than the fat tree it replaces, lifts throughput by up to 33%, and cuts network equipment electricity consumption by 40%. The 69% router reduction translates directly into less power, less cooling, and less operational overhead at every site, and Amazon frames the rollout as saving billions of dollars in hardware across the regions where it operates.

The cost figures are wider than the headline. The arXiv paper puts RNG at 9 to 45% cheaper than fat trees with equivalent oversubscription, depending on the design ratio and independent of network size. That range matches the 45% ceiling cited in the Amazon Science post. The company validated the design with 530 processor-years of simulation, the rough equivalent of running a single CPU for half a millennium, executed on Amazon EC2. End-to-end benchmarks in production fabrics matched fat tree performance for multipath-transport workloads and latency-sensitive storage, with no customer workload changes required.

From Dublin to Default by April 2026

The first quasi-random RNG fabric went live at the end of 2024, carrying real production traffic. AWS used that deployment to validate the topology against the mathematical predictions and to identify operational refinements. Two additional deployments followed, and AWS applied the same refinements there.

  • 2023: A Slack message starts the project that becomes RNG.
  • End of 2024: First RNG fabric goes live near Dublin, Ireland, on production traffic.
  • 2025 to early 2026: Two more deployments carry the refinements from Dublin.
  • April 2026: RNG becomes the default architecture for most new AWS data centers globally.

By April 2026, quasi-random wiring had become the default for most new AWS data centers globally, per the inside story of how AWS built RNG. The deployment is invisible to customers, and the network operates transparently beneath existing applications. The sequencing of the rollout gave AWS a way to test the operational playbook, including how technicians plug new racks into the device’s ports and how the routing protocol behaves under real workload shifts, before committing the design to global default.

What the Industry Does With It

Google Cloud and Microsoft Azure, the next two largest cloud providers, have not publicly committed to RNG-style fabrics. AWS has now shown that flat topologies can be deployed at hyperscale on commodity hardware, and the academic case for them has been on the table for over a decade. The arXiv paper explicitly targets other hyperscalers, framing RNG as an approach any of them could adopt once the routing and cabling pieces are solved.

For competitors, the question is not whether flat topologies work, but whether the ShuffleBox and Spraypoint pieces are now a license they can buy or copy, or a moat Amazon has built for itself.

The cost savings scale with each new data center AWS builds, and the deep router cut is the largest single lever in the math. Routers carry both capital and operating cost, and less hardware also means less electricity and less cooling load at each site. AWS has framed the rollout as something that will lower CO2 emissions across a growing number of grids where it operates. The company has also described the shift as saving billions of dollars in hardware across the regions where it operates. For competitors, the open question is how much of that advantage is portable through the academic paper alone, and how much sits inside the ShuffleBox hardware and the Spraypoint protocol that AWS has chosen to disclose.

What remains open is whether the design scales to the most demanding AI training fabrics, which often rely on rail-optimized fat trees and aggregation-layer capacity islands. The arXiv paper notes that flat, random topologies do not have aggregation islands, which can matter for specialized workloads, and the authors flag large-scale AI training as future work. For general multi-tenant data center traffic, RNG is now the AWS default.

Frequently Asked Questions

What is AWS RNG?

AWS RNG, or Resilient Network Graphs, is a quasi-random network topology that connects commodity routers directly to each other instead of layering them in a fat tree. The design pairs a routing protocol called Spraypoint with a passive optical device called a ShuffleBox, and AWS deployed it first near Dublin at the end of 2024.

What does AWS RNG replace?

It replaces the fat tree network topology that has organized hyperscale data centers for decades. A fat tree stacks routers in aggregation and spine layers and forces traffic through those layers, while RNG connects routers to each other directly in a quasi-random pattern that exposes more available paths at any moment.

How much does AWS RNG save compared with a fat tree?

According to AWS, RNG uses 69% fewer routers, lifts throughput by up to 33%, and reduces network equipment electricity consumption by 40%. The arXiv paper puts RNG at 9 to 45% cheaper than fat trees with equivalent oversubscription ratios, depending on the design parameters.

When did AWS deploy RNG?

The first quasi-random RNG network went live near Dublin at the end of 2024. Two additional deployments followed, and by April 2026, RNG had become the default architecture for most new AWS data centers globally.

Will other cloud providers adopt random-graph topologies?

Neither Google Cloud nor Microsoft Azure has publicly committed to RNG-style fabrics. AWS’s own paper frames the design as a general approach any hyperscaler could adopt once the routing and cabling pieces are solved, and the academic foundation has been public for over a decade.

Leave a Reply

Your email address will not be published. Required fields are marked *