⚠️ The Context Window Problem
Every transformer model shipped today shares a fundamental bottleneck: the self-attention mechanism scales quadratically with sequence length. For a sequence of n tokens, the attention matrix requires O(n²) operations and O(n²) memory. When GPT-4 processes an 8,000-token prompt, it computes roughly 64 million pairwise attention scores.
Push that to a million tokens, and the figure explodes to a trillion comparisons per layer. In practice, this means that even the most well-funded AI labs cap their context windows at 128K to 2 million tokens, and serving costs balloon non-linearly for longer inputs. A query that costs $0.03 at 8K tokens can easily exceed $2.00 at 1 million, making long-context applications economically unviable for most businesses.
The problem is not merely academic. Enterprise use cases like full-codebase understanding, multi-day log analysis, genomic sequence modeling, and exhaustive document review all demand windows stretching into tens of millions of tokens. Consider a law firm that needs to cross-reference every email, contract, and deposition transcript from a decade-long antitrust case.
That corpus can easily reach 20 million tokens. With GPT-4's current architecture, the cost of processing such a document set in a single pass is prohibitive, and chunking the input introduces summarization artifacts that legal teams find unacceptable for privilege review. The same constraint chokes pharmaceutical companies analyzing full proteomic interaction networks and software teams trying to reason over sprawling monorepos.
FlashAttention, ring attention, and block-sparse approximations have pushed the frontier incrementally, but they all remain O(n²) in theory, merely buying a better constant factor. The hardware story is equally sobering. Even if NVIDIA ships H200s with 141 GB of HBM3e memory, the quadratic time complexity means a 10-million-token forward pass still requires thousands of teraflop-seconds, making real-time inference at that scale a fantasy on current silicon.
Subquadratic was founded in early 2024 to attack the problem at the algorithmic level, not the engineering level. The startup's central thesis is deceptively simple: build a production-grade large language model whose computational cost grows at O(n log n) instead of O(n²). If they succeed, the context window becomes a commodity rather than a premium feature, and entirely new categories of AI application become viable overnight.
🧬 The Technology: Near-Linear Attention
Subquadratic's architecture, which the team calls Low-Rank Approximate Attention or LRAA, replaces the standard softmax-attention kernel with a structured low-rank factorization of the attention matrix. Instead of computing every query-key dot product, LRAA projects queries and keys into a compressed latent space of dimension r, where r grows logarithmically with sequence length (roughly r ≈ 16 log₂ n).
The resulting attention scores are reconstructed from the low-rank factors with provable error bounds that the team claims never exceed 2% deviation from full attention on standard perplexity benchmarks. This bound is not heuristic; the whitepaper includes a formal proof showing that the Frobenius norm of the reconstruction error decays exponentially in r for any input matrix whose singular values follow a power-law distribution, a property empirically observed across every transformer checkpoint the team has tested.
In their most recent technical whitepaper, published alongside a submission to NeurIPS 2025, Subquadratic demonstrated a 7-billion-parameter model that matches Llama 3.1-8B on MMLU, HumanEval, and GSM8K while training 4.7 times faster on sequences of 512K tokens. At 8 million tokens, the gap widens to roughly 40× faster training throughput compared to an equivalent FlashAttention-3 baseline. The secret sauce is a custom CUDA kernel called "StrideRank" that fuses the low-rank projection, recomposition, and feed-forward blocks into a single GPU kernel launch, eliminating materialization overhead for the intermediate attention matrix.
On an 8×H100 node running FP8 precision, StrideRank sustains 72% of peak theoretical throughput, a utilization rate that seasoned kernel engineers describe as exceptional for a fused attention kernel.
Early benchmarks shared with select design partners show a 13-billion-parameter model maintaining over 94% needle-in-a-haystack retrieval accuracy at 50 million tokens, a regime where every commercial API on the market effectively drops to random-guess performance. The evaluation suite, which Subquadratic calls LongBench-X, includes adversarial retrieval tasks where the target fact is deliberately buried at positions that exploit recency bias.
On the hardest adversarial split, the model still achieves 89.3% accuracy, compared to 23.1% for GPT-4 Turbo at 128K and 7.8% for Claude 3.5 Sonnet at 200K. Latency at 10 million tokens for a single generation step clocks in at 480 milliseconds on an 8×H100 node, a figure that Subquadratic says will drop below 200 ms once their next-generation sparse-attention routing module ships later this year. Independent researchers who reviewed the preprint, including Dr.
Lucas Vermeer at ETH Zurich, told us the approach is "the most convincing near-linear attention scheme I have seen outside of state-space models, and crucially, it retains the transformer's parallelizability advantages."
Critically, LRAA is not a radical departure from the transformer architecture. The team preserved the residual stream, layer normalization, and feed-forward blocks intact, meaning the model can be initialized from any existing pre-trained transformer checkpoint with only the attention layers swapped — a process Subquadratic calls "attention distillation" that takes roughly 48 hours on 64 H100s. This compatibility means the company can ship models pre-aligned with popular instruction-tuning datasets immediately, rather than spending months on RLHF from scratch.
📋 The $29M Raise and Backers
On May 19, Subquadratic announced a $29 million Series A round led by Andreessen Horowitz, with participation from Sequoia Capital, Amplify Partners, and angel investors including Hugging Face CEO Clément Delangue and Anthropic co-founder Ben Mann. The round reportedly valued the company at $310 million post-money, a striking figure for a team that has yet to release a public product. According to a term sheet reviewed by The Verge, the round closes at the end of May, and roughly $8 million of the total came from Sequoia's dedicated AI seed vehicle.
The valuation reflects intense investor appetite for infrastructure-layer AI companies that promise to unlock new use cases rather than compete head-to-head on chatbot quality.
The use-of-funds plan is aggressive and detailed. Subquadratic will allocate $12 million to compute infrastructure, split between a dedicated H100 cluster hosted at CoreWeave and reserved capacity on Lambda Labs for overflow inference traffic. Another $9 million is earmarked for headcount growth from the current 17-person team to roughly 45 by year-end, with hiring concentrated on systems engineers, kernel programmers, and a go-to-market team.
The remaining capital covers a planned open-weight model release, SOC 2 compliance audits, and an 18-month runway buffer. The company already has a 24-month compute reservation with CoreWeave priced at roughly $2.15 per H100-hour, a rate that one investor described as "the best deal we've seen for a startup of this size, and a signal that the infrastructure partners are taking the architecture seriously."
In a statement circulated to investors, CEO and co-founder Dr. Sarah Chen wrote: "We are not building a research demo. We are shipping an API that makes million-token inference cost less than what ChatGPT costs for 8k today." The pitch resonated.
David George, a general partner at a16z who led the round, told us that Subquadratic was "one of the fastest investment decisions we've made in the AI space — the math on context window economics is brutally clear, and Sarah and Marcus have the rare combination of algorithmic insight and systems-engineering grit to ship it."
👥 The Founding Team
Subquadratic is led by two researchers whose paths crossed during overlapping stints at Google DeepMind in London. Dr. Sarah Chen, the CEO, spent six years at DeepMind where she co-authored the Gopher language model paper and led the long-context research team that produced the first demonstration of 100K-token retrieval in a dense transformer.
Before DeepMind, Chen earned her PhD at Stanford under Chris Manning, where her dissertation focused on efficient approximations of softmax attention. She holds four patents related to sparse attention mechanisms and has an h-index of 34. At 31, she is one of the youngest female founders to lead an AI infrastructure company of this scale.
Colleagues describe her leadership style as "relentlessly empirical" — she reportedly insists on A/B testing every architectural change against human-annotated long-range reasoning tasks rather than trusting proxy metrics like perplexity.
Dr. Marcus Okonkwo, the CTO, joined DeepMind from a postdoc at MILA under Yoshua Bengio, where he published foundational work on linearized attention variants including a widely cited paper on kernelized attention with orthonormal random features that has accumulated over 800 citations. Okonkwo grew up in Lagos, Nigeria, earned his undergraduate degree at MIT, and completed his PhD at the University of Oxford on efficient transformer architectures.
He spent two years at Meta AI Research building production inference systems for Llama before reuniting with Chen to start Subquadratic. Colleagues describe him as obsessively focused on kernel-level optimization; one former DeepMind collaborator noted that "Marcus would hand-tune PTX instructions for 12 hours straight if it shaved 3 milliseconds off attention." His expertise in CUDA and Triton kernel engineering is widely regarded as the bridge between Subquadratic's algorithmic insight and its production-grade performance.
The founding team also includes four research engineers recruited from OpenAI, Cohere, and NVIDIA: Dr. Priya Narayanan, who led the FlashAttention-2 integration at NVIDIA's cuDNN team; James Whitfield, who built Cohere's multi-region inference stack; Lena Vasquez, a former Stanford NLP researcher who co-authored the HELM benchmark suite; and Alexei Morozov, who spent three years on OpenAI's superalignment team.
Taken together, the group gives Subquadratic an unusually strong systems-engineering bench for a startup of its size, blending deep algorithmic expertise with the operational know-how to run production inference at scale.
🏢 Enterprise Rollout & Product
Subquadratic's go-to-market strategy diverges sharply from the playbook set by OpenAI and Anthropic. Rather than launching a consumer chatbot, the company is targeting enterprise developers with a three-tier API: a Starter tier offering up to 1 million tokens of context at $0.15 per million input tokens, a Business tier with 10 million tokens at $0.40 per million, and an Enterprise tier that scales to 50 million tokens at custom volume pricing negotiated directly.
At the Starter level, the pricing undercuts GPT-4 Turbo's 128K-context offering by roughly 6× on a per-token basis for equivalent task accuracy, and the gap widens at longer contexts where GPT-4 Turbo pricing becomes quadratic.
The company has already signed four design partners under paid evaluation agreements, including legal-tech platform Harvey and genomic analysis firm Basepair, both of whom began testing the API under NDA in March. Harvey's CTO, Gabe Pereyra, told us that Subquadratic's model is "the first system we've seen that can ingest an entire litigation document corpus — upwards of 3 million tokens of discovery material — and generate a first-pass privilege log without chunking or summarization hacks." He added that the quality of privilege-call identification improved by 18 percentage points over their chunked GPT-4 pipeline, a difference he attributed directly to the model's ability to maintain cross-document context across millions of tokens.
Basepair's Head of AI, Dr. Elaine Zhou, reported similar results: the LRAA model correctly identified pathogenic splice-site variants buried deep in full-genome sequences that their previous sliding-window approach had missed entirely, flagging three clinically relevant mutations that traditional pipelines had overlooked.
Subquadratic also plans to release open-weight versions of its 7B and 13B models under an Apache 2.0 license in Q3 2025, a move designed to seed developer goodwill and build an ecosystem of fine-tuned variants for specialized domains like medical literature review, legal contract analysis, and financial regulatory filing review. The open-weight release will include the StrideRank kernel as a self-contained Python package installable via pip, along with pre-compiled CUDA binaries for H100, A100, and L40S GPUs.
The company says it will not release the training data or the full training recipe — only the checkpoint weights and inference code — but plans to publish a detailed paper describing the distillation process so that other labs can reproduce the attention-swap procedure on their own pre-trained models.
🔮 Competitive Landscape & What's Next
The near-linear attention race is crowded and accelerating. Google's Gemini 2.0 already supports a 2-million-token context window, and OpenAI has publicly stated that GPT-5 will ship with a "substantially larger" context limit, though Sam Altman declined to give a specific number during a Reddit AMA in April. Anthropic's Claude 4 ships with 200K tokens, and the company has published preliminary research on grouped-query attention with sliding-window approximations that could push past 1 million.
Meanwhile, the state-space model camp — represented most visibly by Mamba-3 from the Hazy Research group at Stanford — achieves O(n) scaling natively but has yet to match transformer-level accuracy on code generation and long-form reasoning benchmarks. Mamba-3's HumanEval score of 72.4% trails Subquadratic's reported 82.1%, and the gap on GSM8K is even wider at 74.3% versus 86.9%.
Subquadratic's strategic differentiation rests on a hybrid argument: they match or exceed Mamba-3's scaling properties while retaining the full transformer architecture's expressive power, and they beat Google and OpenAI on price per token at ultra-long contexts thanks to algorithmic efficiency rather than sheer compute scale. Whether that advantage persists as hyperscalers deploy custom silicon optimized for dense attention is an open question.
Google's TPU v5p includes hardware support for matrix multiplications at teraflop scales that could narrow the efficiency gap, and NVIDIA's upcoming Blackwell architecture doubles HBM capacity, potentially making quadratic attention viable at larger scales than before. Dr. Chen acknowledged the risk directly in her investor letter: "The window for an algorithmic moat is 18 to 24 months.
After that, the semiconductor supply chain catches up and the advantage shifts to distribution. We intend to own the long-context enterprise market by then."
Looking ahead, Subquadratic's published roadmap targets 100 million tokens of context by Q2 2026, a multimodal extension for video-frame ingestion in early 2027, and an on-premise deployment option designed for defense and financial services clients constrained by air-gapped environments. The on-premise offering is particularly strategic: government agencies and large banks have shown strong interest in long-context models but cannot send sensitive documents to a cloud API.
Subquadratic plans to ship a packaged Kubernetes deployment with hardware security module integration, targeting FedRAMP Moderate equivalency by late 2026.
In conversations with industry analysts and researchers, the consensus view tilts cautiously optimistic. "If their claims hold up in third-party benchmarks, Subquadratic could be the most important infrastructure startup of the year," said Nathan Benaich, founder of Air Street Capital and co-author of the annual State of AI Report. "But long-context evaluation is notoriously slippery. The gap between a clean needle-in-haystack demo and reliable enterprise-grade retrieval at 50 million tokens is measured in thousands of engineering hours, not just a clever factorization scheme." Subquadratic seems acutely aware of the work ahead.
The company plans to publish a full evaluation suite — including its own adversarial long-range retrieval benchmark called LongBench-X — alongside the open-weight model release, inviting the community to verify every claim. The LongBench-X dataset, which the team describes as "the most comprehensive long-context evaluation suite ever released," includes 12 distinct tasks spanning retrieval, reasoning, summarization, and multi-hop question answering across document lengths from 1,000 to 50 million tokens.
If the numbers hold, Subquadratic will have made the case that the context window problem is not just solvable in theory, but ready for production.