ARM Now Runs Half the Cloud: Graviton 4, Ampere Altra Max, and the Numbers Behind x86's Retreat

The Shift Is No Longer Theoretical
For most of the past decade, ARM in the server room was a promise — always two years away from being production-ready. That time has passed. AWS reports that its Graviton-based instances now power a substantial and growing share of its compute fleet. Ampere's Altra Max chips are running production workloads at Oracle Cloud, Microsoft Azure, and Google Cloud. NVIDIA's Grace CPU is shipping in Grace Hopper Superchips deployed in AI clusters worldwide. The question is no longer whether ARM can handle server workloads. The question is which workloads still justify paying the x86 premium.
The core thesis is simple and backed by numbers: ARM server chips deliver more throughput per watt and more throughput per dollar than their x86 counterparts on the workloads that dominate modern cloud spending — web serving, containerized microservices, in-memory caching, and machine learning inference. x86 retains real advantages in single-threaded legacy software, Windows Server workloads, and applications with hard dependencies on x86 ISA extensions. Everything else is a migration conversation.
AWS Graviton 4: The Benchmark That Changed the Conversation
AWS Graviton 4, launched in late 2023 and powering the R8g, C8g, and M8g instance families, is built on a custom ARM Neoverse V2 core at 3nm TSMC process. The chip ships with 96 cores, DDR5-5600 memory support, and a 75 MB system-level cache. AWS states that Graviton 4 delivers up to 30% better compute performance compared to Graviton 3, and up to 40% better performance per watt compared to comparable x86 instances in its own fleet.
On SPECrate2017_int_base, third-party testing of Graviton 4 instances scores in the range of 650–700 aggregate across all cores, competitive with Intel Xeon Sapphire Rapids at similar price points while drawing less power at the instance boundary. For Java-based workloads — a major slice of enterprise cloud spend — Graviton 4 scores roughly 20–25% higher throughput on SPECjbb2015 than Graviton 3, which itself already outperformed comparable Intel instances on that benchmark.
The pricing argument is direct. An AWS m8g.4xlarge (16 vCPU, Graviton 4) costs approximately $0.616/hour on-demand in us-east-1. A comparable m7i.4xlarge (16 vCPU, Intel Sapphire Rapids) runs at approximately $0.806/hour. That is a 24% cost reduction before you factor in that the ARM instance often handles higher request throughput per vCPU on stateless workloads.
Ampere Altra Max: 128 Cores, Single-Threaded Predictability
Ampere Computing's Altra Max is architecturally different from Graviton 4 in a deliberate way. Where AWS uses a high-performance core design derived from Neoverse V2, Ampere uses its own single-threaded cores — no simultaneous multithreading (SMT). The Altra Max ships with up to 128 cores, each running at up to 3.0 GHz, with a 128 MB L3 cache and 8-channel DDR4-3200 memory. TDP sits at 250–270W for the 128-core variant.
The absence of SMT is a design choice with real consequences. Cloud providers using Altra Max can advertise vCPUs that map 1:1 to physical cores, eliminating the noisy-neighbor variance that plagues SMT-enabled x86 instances under mixed load. Oracle Cloud Infrastructure uses Ampere A1 instances (earlier-generation Altra) at $0.01/OCPU-hour, making it the cheapest compute option from any major cloud provider. Benchmark results from Phoronix on Altra Max nodes show linear scaling to 128 threads on embarrassingly parallel workloads — something x86 chips with SMT stop delivering cleanly past their physical core count.
Ampere's target workload list reads like a catalog of modern infrastructure: NGINX, HAProxy, Redis, Memcached, PostgreSQL with read-heavy workloads, and containerized microservices on Kubernetes. For teams running these stacks, Altra Max instances measurably reduce per-request cost.
NVIDIA Grace: ARM Meets HBM3 for AI Workloads
NVIDIA's Grace CPU, used in the Grace Hopper and Grace Blackwell Superchip configurations, is a 72-core ARM Neoverse V2 design connected via NVLink-C2C to NVIDIA GPU dies. The Grace CPU itself has a 500 GB/s memory bandwidth figure using LPDDR5X, which dwarfs what conventional DDR5 channels deliver on x86 server platforms.
In the GH200 Grace Hopper Superchip, the CPU and H100 GPU share a unified memory fabric at 900 GB/s between them. This is not a marketing claim — it eliminates the PCIe bottleneck that limits GPU utilization in LLM inference workloads where the model must frequently move data between CPU and GPU memory. For inference of large language models and multimodal models, the GH200 delivers measurably higher tokens-per-second per dollar than equivalent H100 SXM5 configurations using x86 host CPUs, primarily by reducing data transfer latency.
Apple M4 Ultra in Mac Pro: ARM at the Professional Workstation Tier
Apple's M4 Ultra, announced for the 2025 Mac Pro, combines two M4 Max dies via UltraFusion interconnect, producing a chip with up to 80 CPU cores (60 performance, 20 efficiency), up to 80 GPU cores, and a unified memory architecture supporting up to 192 GB at over 800 GB/s aggregate bandwidth. TDP for the M4 Ultra system sits around 300W total system power, which is comparable to a single high-end Intel Xeon W die alone.
The Mac Pro is not a cloud server, but its benchmarks inform the server debate directly. In Cinebench R24 nT, M4 Ultra scores approximately 9,000–9,500 points on multi-core — comparable to a Threadripper 7970X at roughly double the power draw. Developers building and testing ARM-native containerized applications on M4 Ultra Mac Pros are already running production-equivalent workloads locally before deploying to Graviton 4 or Altra Max in production. The software ecosystem alignment is closing fast.
ARM's Architectural Advantages for Server Work
The reasons ARM wins on efficiency are structural, not temporary. The ARM ISA generates smaller instruction footprints than x86, reducing instruction cache pressure. The absence of legacy x87 and complex variable-length decode logic means more of each die area goes toward execution units and cache. Modern ARM server cores like Neoverse V2 and Neoverse N2 implement out-of-order execution with wide pipelines that match or exceed Intel's Golden Cove and AMD's Zen 4 on per-clock throughput for integer and memory-intensive workloads.
Power efficiency numbers are consistent across independent testing. SPECpower_ssj2008 results — which measure performance-per-watt across load levels — show ARM server platforms from AWS, Ampere, and NVIDIA running 15–40% more efficient than x86 equivalents depending on workload and load level. At data center scale, that difference is measured in megawatts and millions of dollars annually.
Where x86 Still Wins
Honesty requires acknowledging where x86 retains the advantage:
- Windows Server workloads — AWS does not offer Graviton Windows instances; Azure Cobalt 100 ARM instances run Linux only as of 2024. SQL Server and .NET Framework (not .NET Core) remain x86-dependent in practice.
- Single-threaded legacy applications — AMD EPYC Genoa and Intel Sapphire Rapids both reach higher single-core boost clocks (up to 4.5 GHz) than current ARM server chips, which matters for serialized workloads.
- AVX-512 dependent workloads — HPC codes and some video transcoding pipelines are hand-tuned to Intel AVX-512 SIMD extensions. ARM's SVE2 is competitive but requires recompilation and re-tuning.
- ISV software with x86-only licensing — Oracle Database, SAP HANA, and several commercial EDA tools either do not support ARM or have separate license terms that erase the cost benefit.
Actionable Takeaways for Engineers Choosing Cloud Instances
- Start your ARM migration with stateless HTTP workloads first. NGINX, Node.js, Go, and containerized Python APIs compile cleanly to ARM64 and show the fastest payback. Use AWS C8g or OCI Ampere A1 instances and run an A/B load test against your current x86 baseline before committing.
- For Java services, enable Graviton 4 aggressively. The JVM has supported ARM64 for years. AWS's own benchmarks show 20–30% throughput gains on Spring Boot and Quarkus workloads on Graviton 4 versus comparable Intel instances at lower cost.
- For AI inference at scale, evaluate GH200 before defaulting to H100 + x86. The unified memory architecture eliminates a real bottleneck for models above 70B parameters. Request access through AWS, CoreWeave, or NVIDIA DGX Cloud to benchmark your specific model.
- Do not migrate Windows Server or AVX-512 HPC workloads yet unless you have confirmed ARM-native builds and tested them. The cost savings do not materialize if the workload underperforms or requires ISA-specific libraries that have not been ported.
- Use Ampere Altra Max instances for Redis, Memcached, and NGINX. The 1:1 vCPU-to-core mapping and linear thread scaling make latency predictability measurably better than SMT-enabled x86 instances under variable load.
ARM's server moment is not coming — it arrived. The remaining work is systematic migration of workloads that still run on x86 out of inertia rather than necessity.