Part X: Frontiers
Chapter 34: Emerging Architectures & Scaling Frontiers

Scaling Frontiers: What Comes Next

"We keep scaling until the electricity bill becomes the research contribution."

Frontier Frontier, Budget Busting AI Agent
Big Picture

The era of "just scale up the transformer on more internet text" is approaching its limits. The next generation of frontier models will be shaped by constraints: finite high-quality data, the economics of compute, and the diminishing returns of naive scaling. This section maps the three axes along which scaling can proceed (data, compute, test-time inference) and surveys the architectural alternatives that may reshape the landscape. Understanding these frontiers is essential for anyone making multi-year bets on AI infrastructure, model selection, or research direction.

Prerequisites

This section builds on pretraining and scaling laws from Chapter 06, inference optimization from Chapter 09, and synthetic data generation from Chapter 13. Familiarity with the Chinchilla scaling laws (Section 06.3) is especially important.

1. The Data Wall

The Chinchilla scaling laws (Hoffmann et al., 2022), covered in Section 06.3, established that optimal training requires scaling data and compute proportionally. For a model with N parameters, you need roughly 20N tokens of training data to achieve compute-optimal performance. This creates a straightforward problem: where do those tokens come from?

How Much Data Exists?

Villalobos et al. (2024) at Epoch AI conducted the most systematic analysis of this question. Their estimates suggest that the total stock of publicly available, high-quality text on the internet is between 4 and 15 trillion tokens, depending on quality thresholds. Lower-quality text (scraped web pages without filtering) extends this to perhaps 30 to 50 trillion tokens, but with rapidly diminishing quality.

To put this in perspective: GPT-4 is estimated to have been trained on approximately 13 trillion tokens. Llama 3 (405B) was trained on 15 trillion tokens. If the Chinchilla-optimal data requirement for a 10-trillion-parameter model is 200 trillion tokens, we have a problem. The internet does not contain enough high-quality text to support the next order-of-magnitude increase in model size.

The data wall is not just about volume. It is also about diversity. As models absorb more of the available text, each additional token provides less novel information. The marginal value of the 15-trillionth token is far less than the marginal value of the 1-trillionth token, because it is increasingly likely to be a paraphrase of content already seen.

Responses to the Data Wall

Frontier labs are pursuing several strategies simultaneously:

Fun Fact

Humanity spent thousands of years producing the written record. Frontier labs consumed most of it in a single training run. The internet, it turns out, is a surprisingly finite resource when your appetite is measured in trillions of tokens.

2. Synthetic Data: Promise and Peril

The most controversial response to the data wall is synthetic data: using model-generated text to train the next generation of models. The promise is obvious. If a frontier model can generate high-quality text, that text can be used to train a larger model, creating a virtuous cycle that decouples scaling from the finite stock of human-generated data.

Where Synthetic Data Works

Synthetic data has proven effective in specific, constrained domains:

Model Collapse: The Danger

Shumailov et al. (2024), published in Nature, demonstrated a sobering phenomenon they termed "model collapse." When models are trained on data generated by previous model generations, the distribution of the training data narrows progressively. Rare events, tail phenomena, and minority perspectives are gradually lost. After several generations of recursive training, the model's output distribution converges to a narrow mode that no longer reflects the diversity of the original human-generated data.

The analogy is photocopying a photocopy: each generation loses fidelity, and the losses compound. The authors showed this effect across multiple model families and training setups, suggesting it is a fundamental property of iterative synthetic data generation rather than an artifact of a particular approach.

The practical implication is that synthetic data can supplement human data but probably cannot fully replace it. A training pipeline that relies primarily on synthetic data will need robust mechanisms to detect and counteract distribution narrowing, such as diversity metrics, tail-event preservation, and periodic infusion of human-generated data.

Mental Model: The Sourdough Starter

Synthetic data is like maintaining a sourdough starter. You can refresh it indefinitely by feeding it flour (compute), and the active cultures (knowledge) persist across generations. But if you only ever feed it the same flour, the microbial diversity narrows. Occasionally, you need to introduce wild yeast (human-generated data) from a new source to maintain a healthy, diverse culture. A purely self-referential starter eventually produces bland, uniform bread.

3. The Three Axes of Scaling

Historically, scaling meant one thing: bigger pretraining runs (more parameters, more data, more compute). But the frontier has diversified into three distinct scaling axes, each with different economics and different diminishing returns profiles.

Axis 1: Pretraining Compute

This is the traditional scaling axis. Train a larger model on more data. The cost grows roughly quadratically with model size (because both the number of parameters and the data requirement increase). Frontier training runs in 2025 are estimated to cost $100 million to $1 billion. The returns, as measured by benchmark improvements, are diminishing: each doubling of compute yields smaller absolute gains.

Axis 2: Data Quality and Curation

The Llama strategy (Touvron et al., 2023) demonstrated an alternative: train a smaller model for longer on higher-quality data. Llama 1 (65B) was trained on 1.4 trillion tokens, far exceeding the Chinchilla-optimal data allocation for its size. This "over-training" approach produces models that are less compute-efficient during training but more token-efficient during inference, because the resulting model is smaller and cheaper to serve.

This insight has reshaped the industry. Most practitioners serve models, not train them. A model that costs twice as much to train but is half the size at serving time is often the better economic choice. The data quality axis offers a way to improve performance without increasing model size, by investing in better data curation, filtering, and curriculum design.

Axis 3: Test-Time Compute

The most significant recent development is the emergence of test-time compute scaling, as discussed in the context of reasoning models in Section 07.4. Models like OpenAI's o1 and o3, DeepSeek-R1, and Anthropic's extended thinking demonstrate that spending more compute at inference time (through chain-of-thought reasoning, search, verification, and self-correction) can dramatically improve performance on hard tasks.

The economics of test-time scaling are fundamentally different from pretraining scaling. Pretraining is a one-time cost amortized over all future queries. Test-time compute is a per-query cost that scales linearly with usage. This means test-time scaling is most cost-effective for hard, high-value queries where the additional compute is justified by the value of a correct answer.

The interplay between pretraining and test-time compute creates interesting trade-offs. A smaller, well-trained model with generous test-time compute can sometimes match a much larger model with minimal test-time compute. The optimal allocation depends on the query distribution: if most queries are easy, invest in pretraining; if a few queries are very hard and high-value, invest in test-time compute.

The table below summarizes how these three scaling axes differ in their cost structure, return profiles, and ideal use cases.

Scaling Axis Comparison
Scaling Axis Cost Structure Diminishing Returns Best For
Pretraining compute One-time, amortized Moderate (power law) General capability uplift
Data quality One-time curation investment Low (quality improvements compound) Efficiency, smaller serving models
Test-time compute Per-query, variable Task-dependent Hard tasks, high-value queries

Notice that pretraining compute and data quality are one-time investments, while test-time compute is an ongoing per-query expense. This asymmetry is central to the economic calculus of frontier model development.

Key Insight

The three scaling axes are not independent; they interact. A model trained on better data (Axis 2) may need less test-time compute (Axis 3) because it has stronger priors. A model with more pretraining compute (Axis 1) may benefit less from test-time reasoning because it has already internalized the relevant knowledge. The optimal scaling strategy for a given application depends on the query mix, the cost constraints, and the latency requirements. There is no universally optimal allocation. The practical implication for teams making deployment decisions (see the cost analysis in Section 33.5) is to benchmark each axis independently for your specific use case rather than relying on general-purpose scaling curves.

4. Alternative Architectures

Beyond scaling existing transformers along these three axes, a separate question looms: is the transformer itself the right architecture? The transformer has dominated language modeling since 2017, but several alternative architectures are challenging its supremacy, particularly for long-context and efficiency-critical applications.

State Space Models (SSMs)

State space models, particularly the Mamba architecture (Gu and Dao, 2023), offer a fundamentally different approach to sequence modeling. Instead of the quadratic attention mechanism, SSMs process sequences through a linear recurrence that scales linearly with sequence length. This makes them dramatically more efficient for long sequences.

The key insight behind Mamba is the "selective state space" mechanism, which allows the model to selectively propagate or forget information based on the input content. This addresses the principal limitation of earlier SSMs, which processed all tokens identically regardless of content relevance.

As of early 2026, SSMs have demonstrated competitive performance with transformers on many benchmarks, with clear advantages in throughput for long sequences. However, they have not yet matched transformers on tasks that require complex in-context learning or precise information retrieval from context. Hybrid architectures that combine transformer attention layers with SSM layers are emerging as a pragmatic compromise, offering the efficiency of SSMs for most processing while retaining attention for tasks that require it.

Mixture of Experts (MoE) at Scale

Mixture of experts is not a new idea, but its application at frontier scale has been transformative. Models like Mixtral (Jiang et al., 2024), DeepSeek-V2, and reportedly GPT-4 use sparse MoE architectures where only a fraction of the model's parameters are activated for each token. This decouples total model capacity (total parameters) from per-token compute (active parameters).

The scaling implications are significant. A 1.8-trillion-parameter MoE model that activates only 100 billion parameters per token has the knowledge capacity of a dense 1.8T model with the inference cost of a dense 100B model. This allows MoE models to scale capacity without proportionally scaling inference cost, although training cost and memory requirements still scale with total parameters.

Open questions for MoE scaling include: expert load balancing at extreme scale, the optimal number and granularity of experts, routing stability during training, and whether MoE models exhibit different emergence patterns than dense models.

Diffusion Language Models

A more speculative direction is applying diffusion processes (successful in image generation) to language modeling. Unlike autoregressive models that generate tokens left-to-right, diffusion language models generate all tokens simultaneously through iterative refinement, starting from noise and progressively denoising.

Early results (e.g., MDLM by Sahoo et al., 2024) are promising but still lag behind autoregressive models on standard benchmarks. The potential advantage is in tasks that benefit from bidirectional generation, such as infilling, editing, and constrained generation. If diffusion language models can close the quality gap, they may offer qualitatively different capabilities from autoregressive models.

5. The Chinchilla Trap and the Llama Strategy

An important subtlety in the scaling debate is the distinction between compute-optimal training and inference-optimal training. The Chinchilla scaling laws tell you how to allocate a fixed compute budget to minimize training loss. But in practice, models are trained once and served millions (or billions) of times. The total cost of ownership is dominated by inference cost, not training cost.

This creates what we might call the "Chinchilla trap": if you follow the compute-optimal recipe, you end up with a large model that is expensive to serve. The Llama strategy (and its successors) deliberately violates the Chinchilla prescription by training a smaller model on more data than is compute-optimal. The resulting model has slightly higher training loss than a Chinchilla-optimal model trained with the same total compute, but it is much cheaper to serve because it has fewer parameters.

The Chinchilla trap illustrates a broader lesson: optimizing for one metric (training compute efficiency) can be counterproductive when the real objective (total cost of ownership) involves a different metric. This lesson recurs throughout the scaling frontier: the optimal strategy depends critically on what you are optimizing for.

6. Multimodal Scaling

A growing body of evidence suggests that training on multiple modalities (text, images, video, audio, code) improves performance on each individual modality compared to training on that modality alone. This is not merely because more data is available; it is because different modalities provide complementary learning signals.

For example, visual grounding helps language models understand spatial relationships. Code training improves logical reasoning. Audio training with transcripts provides a richer model of natural language than text alone. The "scaling" in multimodal scaling is not just about more tokens; it is about a richer information-per-token that accelerates learning.

The practical challenge is tokenization and alignment. How do you represent a video frame, a spectrogram, and a paragraph of text in a shared token space that allows the model to learn cross-modal relationships? The approaches surveyed in Chapter 27 (native multimodal encoders, adapter-based fusion, interleaved pretraining) are all active areas of scaling research.

Note

Of the three scaling axes, many researchers argue that test-time compute scaling will have the largest practical impact over the next two to three years. Pretraining is hitting data and cost ceilings, and the improvements from data curation, while real, are incremental. Test-time compute, by contrast, is a new dimension with a steep improvement curve. The ability to "think longer" on hard problems, combined with verifiers and search, is producing the most dramatic capability gains seen since the original scaling breakthrough. The analogy to human cognition is instructive: people do not become smarter by growing more neurons, but by learning to think more carefully. It has been widely discussed that the most capable AI systems of the near future may be relatively modest in parameter count but sophisticated in their test-time reasoning strategies.

Exercises

Exercise 34.2.1: Data Budget Analysis (Analysis) Analysis

Suppose you are planning a pretraining run for a 70-billion-parameter model. The Chinchilla-optimal data allocation is approximately 1.4 trillion tokens (20x the parameter count). You have access to 500 billion tokens of high-quality curated data and 5 trillion tokens of lower-quality web-scraped data.

  1. What are your options for meeting the data requirement? List at least three strategies.
  2. For each strategy, identify the primary risk and the mitigation approach.
  3. If your primary goal is to minimize inference cost rather than training cost, how does this change your strategy?
Show Answer

1. Strategies: (a) Use all 500B high-quality tokens plus 900B of the best web-scraped tokens (filtered aggressively). (b) Train for 3 epochs on the 500B high-quality tokens, reaching 1.5T tokens with repetition. (c) Use synthetic data generation to expand the 500B tokens into 1.4T tokens. (d) Adopt the Llama strategy: train a smaller model (e.g., 30B) for longer on the available data, accepting the Chinchilla violation.

2. Risks and mitigations: (a) Quality degradation from web-scraped data; mitigate with perplexity filtering and quality classifiers. (b) Diminishing returns from repeated data; mitigate with careful learning rate scheduling and data ordering. (c) Model collapse from synthetic data; mitigate with diversity metrics and mixing synthetic with human data at a controlled ratio. (d) Suboptimal training loss; acceptable if inference cost is the priority.

3. If minimizing inference cost is the goal, strategy (d) becomes strongly preferred. A smaller model trained on more data (even repeated or lower-quality data) will be cheaper to serve. The slight loss in training efficiency is more than compensated by the reduced per-query inference cost at scale.

Exercise 34.2.2: Architecture Trade-offs (Discussion) Discussion

You are the CTO of a company building a document processing pipeline that handles contracts averaging 50,000 tokens. Your current system uses a transformer-based model with a 32K context window, requiring document chunking and retrieval. A vendor offers an SSM-based model with a 256K effective context window at similar per-token cost.

  1. What are the potential advantages of switching to the SSM model for your use case?
  2. What risks should you evaluate before committing to the switch?
  3. How would you design an evaluation to compare the two approaches on your specific workload?
Show Answer

1. Advantages: eliminates the chunking/retrieval pipeline (reducing engineering complexity and potential retrieval errors); the model can attend to the entire document simultaneously, potentially improving consistency and cross-reference accuracy; long-range dependencies (e.g., a clause on page 1 modifying a clause on page 40) can be captured directly.

2. Risks: (a) SSMs may underperform transformers on precise information retrieval from long contexts (the "needle in a haystack" problem). (b) The 256K context claim may not hold for all task types; effective context can be shorter than theoretical context. (c) Vendor lock-in on a less mature architecture. (d) The SSM's performance on your specific domain (legal text) may differ from general benchmarks.

3. Evaluation design: (a) Create a test suite of real contracts with ground-truth annotations. (b) Test both models on the same tasks: extraction, summarization, cross-reference resolution, and question answering. (c) Specifically test long-range dependencies: questions that require information from distant parts of the document. (d) Measure not just accuracy but latency, cost, and consistency across runs. (e) Include adversarial tests: documents with contradictory clauses, unusual formatting, and embedded tables.

Tip: Watch for State Space Model Deployments

Models like Mamba offer $O(n)$ inference instead of $O(n^{2})$ for transformers, making them promising for very long sequences. If your application processes documents over 100K tokens, benchmark an SSM variant alongside your transformer baseline.

Key Takeaways

What Comes Next

In the next section, Section 34.3: Alignment Research Frontiers, we explore the open problems in aligning AI systems with human values, including scalable oversight, weak-to-strong generalization, and reward hacking.

References & Further Reading
Scaling Laws & Compute Optimization

Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022.

The Chinchilla paper that redefined compute-optimal training by showing that data and parameters should scale equally. This directly motivates the data wall problem discussed in this section.

📄 Paper

Muennighoff, N., Rush, A., Barak, B., et al. (2023). "Scaling Data-Constrained Language Models." NeurIPS 2023.

Systematically studies what happens when training data is limited, showing that data repetition and augmentation can partially compensate. Essential reading for understanding strategies to push past the data wall.

📄 Paper
The Data Wall

Villalobos, P., Ho, A., Cerina, F., & Sevilla, J. (2024). "Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data." Epoch AI.

Quantifies the timeline for exhausting publicly available training data, projecting when the data wall becomes binding. The central empirical reference for the data scarcity arguments in this section.

📄 Paper

Shumailov, I., Shumilo, Z., Zhao, Y., et al. (2024). "AI Models Collapse When Trained on Recursively Generated Data." Nature, 631, 755-759.

Demonstrates that training on synthetic data from previous model generations leads to model collapse, where output diversity degrades irreversibly. A key cautionary result for synthetic data strategies.

📄 Paper

Penedo, G., Malartic, Q., Hesslow, D., et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." Hugging Face Technical Report.

Documents the curation of a high-quality 15-trillion-token web dataset, illustrating the data quality engineering that extends the useful life of web-sourced training data.

📄 Paper
Architecture Innovations

Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.

Showed that smaller, well-trained models can match much larger ones, catalyzing the open-source LLM ecosystem. Exemplifies the "train longer on more data" approach to scaling efficiency.

📄 Paper

Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.

Introduces a selective state space model that achieves transformer-level quality with linear-time inference. Represents the leading alternative architecture for addressing the quadratic cost of attention.

📄 Paper

Jiang, A. Q., Sablayrolles, A., Roux, A., et al. (2024). "Mixtral of Experts." arXiv:2401.04088.

Demonstrates that mixture-of-experts architectures can deliver frontier performance at a fraction of the inference cost. Directly relevant to the section's discussion of compute economics.

📄 Paper
Emerging Paradigms

Sahoo, S., Arriola, M., Schiff, Y., et al. (2024). "Simple and Effective Masked Diffusion Language Models." NeurIPS 2024.

Presents a diffusion-based approach to language modeling that generates text in parallel rather than autoregressively. Illustrates the new generation paradigms discussed as alternatives to traditional scaling.

📄 Paper

Xu, C., Sun, Q., Zheng, K., et al. (2023). "WizardLM: Empowering Large Language Models to Follow Complex Instructions." arXiv:2304.12244.

Shows how synthetic instruction data generated through "Evol-Instruct" can improve model capability without more pretraining data. Exemplifies the synthetic data strategies discussed as a response to the data wall.

📄 Paper