🏃 AMD Breaks NVIDIA's AI Training Monopoly
AMD's MI350X accelerator captured 22% of the AI accelerator market in Q1 2026, up from 8% in Q1 2025, according to Mercury Research. The surge is driven by Meta's deployment of 350,000 MI350X GPUs for Llama 4 training, Microsoft Azure's new MI350X instances (NDv6 series), and Oracle Cloud Infrastructure's MI350X bare-metal offerings. The MI350X offers 2.3 petaFLOPS of BF16 compute with 288GB of HBM3e memory, roughly matching NVIDIA H200 specifications at a 30-40% lower per-unit cost.
The AMD breakthrough required years of software investment. ROCm 6.2, released in February 2026, finally achieved near-feature-parity with CUDA 12 for the most common AI operations: FlashAttention-3, torch.compile, FSDP2 (Fully Sharded Data Parallel), and tensor parallelism all work reliably on AMD hardware. PyTorch's upstream support for ROCm, added in 2024, means most models and libraries now run on AMD GPUs without modification.
Meta open-sourced its GPU fleet management and training orchestration tools optimized for MI350X, contributing significantly to the ecosystem maturation.
📋 Intel and AWS Find Their Niches
Intel's Gaudi 3 accelerator holds 5% market share, concentrated in cost-sensitive training workloads where its competitive price-performance ($12,500 per card vs $30,000+ for H200) and integrated 10x 200GbE networking make it economical for distributed training of models up to 70B parameters. Stability AI and several Chinese AI labs have built Gaudi 3 clusters, attracted by lower total cost of ownership.
AWS Trainium2, Amazon's second-generation custom AI chip, powered Anthropic Claude 4's training at what AWS claims is 40% lower cost than equivalent NVIDIA H200 clusters. Trainium2 delivers 830 teraFLOPS of BF16 compute with 96GB HBM per chip and is interconnected via NeuronLink v2, a 1.6 Tbps chip-to-chip fabric that scales to 100,000+ chips in a single training cluster. AWS has committed $12 billion to Trainium2 and Trainium3 (announced for 2027) development, signaling serious intent to reduce dependency on NVIDIA.
📋 NVIDIA's Enduring Moat
Despite market share erosion, NVIDIA retains a formidable position. Its 68% unit share translates to 82% revenue share due to premium pricing on Blackwell B200 and H200 GPUs, which command $30,000-$40,000 per unit versus $15,000-$20,000 for MI350X. The CUDA software ecosystem remains the decisive advantage: TensorRT-LLM achieves 20-30% better inference throughput on H200 than vLLM on MI350X for identical models, and NVIDIA's cuDNN, cuBLAS, NCCL, and Megatron-LM libraries represent decades of optimization that competitors are still replicating.
NVIDIA CEO Jensen Huang has framed this as "the software moat," arguing that GPU hardware specifications alone understate NVIDIA's competitive position.