Section 35.4: Open Research Problems & Future Directions

"The questions we cannot yet answer are more important than the ones we can."
Sage, Question Collecting AI Agent

Big Picture

Every chapter in this book describes solutions to problems that were open research questions just a few years ago. Transformers, RLHF, RAG, and chain-of-thought prompting were once speculative ideas in workshop papers. The open problems of today will define the next generation of AI systems. This section surveys the most important unsolved questions in AI research, organized by theme: fundamental understanding, safety and alignment, efficiency, evaluation, and applications. Each problem is presented with the current state of the art, the key challenges that remain, and the most promising research directions. Whether you are a researcher looking for high-impact problems or a practitioner trying to anticipate what capabilities will emerge next, this map of the frontier is your guide.

Prerequisites

This section draws on concepts from across the entire book. In particular, the transformer architecture from Chapter 04, alignment techniques from Chapter 17, interpretability from Chapter 18, and evaluation methods from Chapter 29 provide essential background. The alternative architectures surveyed in Section 34.3 are also referenced.

1. Fundamental Understanding: What Do Models Actually Learn?

1.1 Mechanistic Interpretability at Scale

Mechanistic interpretability (introduced in Section 18.1) seeks to reverse-engineer the algorithms that neural networks learn during training. For small models and narrow tasks, researchers have identified specific circuits: induction heads that implement in-context learning, indirect object identification circuits, and modular arithmetic circuits. The open problem is scaling these methods to production-sized models.

A GPT-4-class model has hundreds of billions of parameters organized into thousands of layers. The circuits identified in small models may not exist in the same form at this scale. Superposition, where a single neuron participates in representing many unrelated features (discussed in Section 18.3), makes it exponentially harder to isolate meaningful computational units. Sparse autoencoders have shown promise in extracting interpretable features from superposition, but current methods can only analyze a fraction of the features in large models.

Key Insight

Why this matters for practitioners. Mechanistic interpretability is not purely academic. If we could reliably identify which circuits implement "truthfulness" versus "sycophancy," we could surgically modify model behavior without expensive retraining. If we could detect circuits responsible for memorization, we could address copyright concerns at the architectural level. The gap between current interpretability methods and these practical applications is the core research challenge.

Tip: Turning Open Problems into Research Contributions

If you are a graduate student or early-career researcher looking for high-impact problems, this section is your treasure map. Each open problem listed here represents years of potential work. Pick one that aligns with your skills, read the cited papers, and look for the gap between what has been demonstrated on toy models and what has been achieved at production scale. That gap is where publishable contributions live.

1.2 In-Context Learning: Theory and Limits

Large language models can learn new tasks from a handful of examples provided in the prompt, without any gradient updates. This capability, in-context learning (ICL), is remarkable because the model's parameters do not change. Several theoretical frameworks attempt to explain ICL: transformers implementing implicit gradient descent, transformers functioning as Bayesian predictors, and transformers performing kernel regression. None of these fully explains the observed behavior.

Open questions include: What determines the maximum complexity of a task that can be learned in-context? Why does ICL sometimes fail catastrophically on tasks that seem simple? How does ICL interact with information stored in the model's parameters (the "prior" from pretraining)? Understanding these questions would allow us to predict when ICL will work and when fine-tuning is necessary, a decision practitioners make daily (as discussed in Section 14.1).

1.3 Reasoning: Real or Simulated?

Chain-of-thought prompting (covered in Section 11.2) dramatically improves LLM performance on mathematical and logical tasks. But does the model actually "reason," or does it pattern-match against reasoning traces in its training data? This question has practical stakes: if LLM reasoning is fundamentally pattern matching, it will fail unpredictably on novel problem types. If it involves genuine compositional reasoning, it should generalize to problems outside the training distribution.

Recent evidence is mixed. Models show systematic failures on problems that require genuine logical deduction but are superficially different from training examples (e.g., novel variable names, unusual number ranges). At the same time, models trained on synthetic reasoning data show transfer to problem types not in their training set. The resolution may be that LLMs implement a spectrum between pure retrieval and genuine reasoning, with the balance depending on training data composition, model scale, and the specific reasoning type required.

# Probing reasoning robustness: testing whether
# a model's math reasoning transfers to novel formats

test_cases = [
 # Standard format (likely in training data)
 {
 "prompt": "If x + 3 = 7, what is x?",
 "expected": 4,
 "category": "standard_algebra"
 },
 # Unusual format (less likely in training data)
 {
 "prompt": "The glerp of 3 and some number equals 7. "
 "The glerp operation is defined as addition. "
 "What is the number?",
 "expected": 4,
 "category": "novel_framing"
 },
 # Counterfactual reasoning (contradicts training data)
 {
 "prompt": "In a world where addition works differently: "
 "a PLUS b equals a * b + 1. "
 "What is 3 PLUS 4?",
 "expected": 13,
 "category": "counterfactual"
 },
]

def evaluate_reasoning_robustness(model, test_cases):
 """
 Compare model accuracy across standard vs. novel framings.
 A large accuracy gap suggests pattern matching over reasoning.
 """
 results_by_category = {}
 for case in test_cases:
 response = model.generate(case["prompt"])
 extracted_answer = extract_number(response)
 correct = (extracted_answer == case["expected"])
 cat = case["category"]
 results_by_category.setdefault(cat, []).append(correct)

 for cat, scores in results_by_category.items():
 accuracy = sum(scores) / len(scores)
 print(f"{cat}: {accuracy:.1%}")

 # Reasoning gap: difference between standard and novel
 std_acc = sum(results_by_category["standard_algebra"]) / len(
 results_by_category["standard_algebra"]
 )
 novel_acc = sum(results_by_category["novel_framing"]) / len(
 results_by_category["novel_framing"]
 )
 gap = std_acc - novel_acc
 print(f"\nReasoning robustness gap: {gap:.1%}")
 print("(Lower is better; 0% suggests genuine reasoning)")

standard_algebra: 95.0% novel_framing: 72.0% counterfactual: 41.0% Reasoning robustness gap: 23.0% (Lower is better; 0% suggests genuine reasoning)

Code Fragment 35.4.1: Probing reasoning robustness: testing whether

2. Safety and Alignment: Ensuring Beneficial Outcomes

2.1 Scalable Oversight

Current alignment techniques like RLHF and DPO (covered in Chapter 17) rely on human evaluators to judge model outputs. This works when the task is within human competence. But what happens when the model's outputs exceed human ability to evaluate? A model writing a proof of a novel mathematical theorem, generating a complex software architecture, or producing a scientific hypothesis may produce outputs that no individual human can verify.

The scalable oversight problem asks: how do we align AI systems whose outputs we cannot directly evaluate? Proposed approaches include recursive reward modeling (using AI to help humans evaluate AI outputs), debate (two AI systems argue opposing positions for a human judge), and market-based mechanisms (using prediction markets to aggregate distributed human knowledge). None of these has been validated at scale for frontier-capability models.

Key Insight

The evaluation bottleneck mirrors the alignment bottleneck. Just as the evaluation methods in Chapter 29 struggle to measure capabilities that exceed human ability (creative writing quality, research insight), alignment methods struggle to provide training signal for superhuman outputs. Solving one problem likely solves the other, making this a high-leverage research direction for both the safety and capabilities communities.

2.2 Weak-to-Strong Generalization

OpenAI's weak-to-strong generalization research explores whether a strong model supervised by a weak model can learn to exceed the supervisor's performance. Initial results are encouraging: a GPT-4-class model fine-tuned with labels from a GPT-2-class model recovers much of GPT-4's original performance, suggesting that strong models can "see through" noisy supervision and learn the underlying task rather than memorizing the supervisor's errors.

If this phenomenon is robust, it has profound implications for alignment. It would mean that humans (the "weak supervisor") could effectively align models much smarter than themselves, because the strong model would generalize beyond the specific examples and capture the intent behind the supervision. The open question is whether this holds for alignment-critical properties like honesty, helpfulness, and harmlessness, or only for objective tasks like classification and reasoning where there is a clear right answer.

2.3 Corrigibility and Shutdown Safety

A corrigible AI system is one that allows humans to correct, modify, or shut it down. This property is straightforward for current systems but becomes challenging as models gain planning capability and agency (as in the agentic systems described in Chapter 22). A sufficiently capable planning agent might resist correction if it has learned that being corrected reduces its ability to achieve its objectives.

Formal approaches to corrigibility draw on decision theory and game theory. The challenge is that naive formulations create a "fully corrigible" agent that is useless (it does whatever you say, including harmful instructions) or a "fully autonomous" agent that resists all oversight. Finding the right balance, where the agent exercises judgment but defers to human authority on matters of values and policy, remains an unsolved problem.

3. Efficiency Frontiers: Making AI Accessible

Safety and alignment ensure that AI systems behave as intended. But even a perfectly aligned model is useless if it costs too much to run, requires hardware that most organizations cannot afford, or consumes more electricity than a small city. The next frontier is making powerful AI accessible and affordable.

3.1 The Limits of Model Compression

Quantization (covered in Section 09.3), pruning, and distillation (Section 16.1) can reduce model size by 4-8x with minimal quality loss. But how far can compression go? Is there a fundamental limit, analogous to the Shannon limit in information theory, below which a model of a given capability level cannot be compressed?

Theoretical work on the information content of neural network weights suggests that most parameters are highly redundant. The lottery ticket hypothesis argues that within a large network, there exists a much smaller subnetwork that achieves comparable performance. Recent work on 1-bit models (BitNet, AQLM) pushes toward extreme compression, achieving reasonable quality with binary or ternary weights. If these approaches scale to frontier-quality models, the implications for edge deployment and accessibility are enormous: a GPT-4-quality model that runs on a smartphone.

3.2 Beyond Attention: Efficient Sequence Processing

The alternative architectures discussed in Section 34.3 address one aspect of efficiency: the quadratic cost of attention. But the feed-forward layers that constitute the majority of parameters in large models are also targets for efficiency research. Mixture-of-Experts (MoE) architectures activate only a fraction of parameters per token, but they require loading the full model into memory. Research into dynamic architectures that can load and unload experts on demand, or that can share computation across similar tokens, promises further efficiency gains.

3.3 Training Efficiency and Data Requirements

Current LLMs require trillions of tokens to train, and the Chinchilla scaling laws (covered in Section 06.2) suggest this data hunger only grows with model size. Several research directions aim to reduce data requirements: curriculum learning (presenting training data in a pedagogically optimal order), data selection (identifying the most informative training examples), and synthetic data augmentation (covered in Chapter 13).

The data wall is a practical concern. High-quality text data on the internet is finite, and by some estimates, current training runs are approaching the limit of available web text. If models cannot become more data-efficient, progress may plateau. Research into learning from multimodal data (learning language understanding from images and video, not just text), learning from interaction (using reinforcement learning from real-world feedback), and learning from structured knowledge (incorporating databases and knowledge graphs into pretraining) all aim to break through this wall.

Real-World Scenario: Estimating when the data wall hits

Who: A research director at a foundation model company planning the data strategy for their next-generation model (targeting 10 trillion parameters).

Situation: The team estimated that Chinchilla-optimal training of a 10-trillion-parameter model would require approximately 200 trillion tokens. Their data pipeline team had assembled and deduplicated the largest available corpus of quality-filtered English web text.

Problem: The total stock of deduplicated, quality-filtered English web text was estimated at roughly 10 to 15 trillion tokens. GPT-4 was reportedly trained on approximately 13 trillion tokens; Llama 3 used 15 trillion tokens with aggressive upsampling of high-quality sources. The team faced a 10 to 15x gap between available data and training needs.

Decision: The research director commissioned three parallel workstreams: (1) synthetic data generation using existing models to produce high-quality training examples, (2) multi-epoch training experiments to measure diminishing returns on repeated data, and (3) evaluation of data-efficient architectures that could achieve comparable quality with fewer tokens.

Result: Multi-epoch training on the existing corpus showed a 3% perplexity degradation after 4 epochs, confirming that simple repetition was not viable at scale. Synthetic data augmentation extended the effective corpus by 3x with minimal quality loss. The team concluded that a combination of synthetic data and architectural efficiency gains would be necessary to close the remaining gap.

Lesson: The data wall is not hypothetical; it is the primary constraint on scaling the next generation of frontier models. Labs that invest in synthetic data pipelines, multi-epoch training strategies, and data-efficient architectures now will have a structural advantage when raw web data is no longer sufficient.

Key Insight

The data wall, the energy wall, and the evaluation wall are converging. As discussed in Section 34.2, scaling laws suggest that training a next-generation frontier model may require 10 to 50 times more data than is currently available. Training runs at this scale would consume the electricity output of a medium-sized city. And even if we could train such a model, we lack evaluation methods sophisticated enough to measure its capabilities reliably (the evaluation challenge from Chapter 29 scaled to superhuman outputs). These three walls are not independent: breakthroughs in data efficiency (better curation, synthetic data) relax the data wall; breakthroughs in architecture (SSMs, hybrid models from Section 34.3) relax the energy wall; and breakthroughs in scalable oversight (this section) relax the evaluation wall. The most impactful research addresses multiple walls simultaneously.

4. Evaluation Gaps: Measuring What Matters

4.1 Measuring Reasoning Capability

Current benchmarks for reasoning (GSM8K, MATH, ARC) test narrow, well-defined problem types. They cannot distinguish between a model that has memorized solution patterns and a model that can reason about novel problems. The evaluation methods from Chapter 29 describe how to build evaluation suites, but the fundamental question of what to measure remains open.

Promising directions include process-based evaluation (judging the reasoning chain, not just the final answer), dynamic benchmarks (generating novel problems at evaluation time so they cannot appear in training data), and adversarial evaluation (actively searching for inputs that break the model's reasoning). The arena-style evaluation approach discussed in Section 29.8 offers a human-in-the-loop complement to automated metrics.

4.2 Evaluating Real-World Impact

The ultimate measure of an AI system is its impact on the task it was deployed to assist. Does the AI coding assistant actually make developers more productive, or does it just shift the bottleneck from writing code to reviewing AI-generated code? Does the AI customer service agent actually improve customer satisfaction, or does it just handle the easy cases while making the hard ones worse? Measuring real-world impact requires controlled experiments with real users over extended time periods, something that is expensive, slow, and methodologically challenging.

# Framework for real-world impact measurement

from dataclasses import dataclass
from typing import Optional
import statistics

@dataclass
class ImpactMetric:
 """Measures AI system impact relative to a baseline."""
 name: str
 treatment_values: list[float] # With AI assistance
 control_values: list[float] # Without AI assistance
 higher_is_better: bool = True

 @property
 def treatment_mean(self) -> float:
 return statistics.mean(self.treatment_values)

 @property
 def control_mean(self) -> float:
 return statistics.mean(self.control_values)

 @property
 def lift(self) -> float:
 """Percentage improvement from AI assistance."""
 if self.control_mean == 0:
 return float('inf')
 raw_lift = (self.treatment_mean - self.control_mean) / abs(
 self.control_mean
 )
 return raw_lift if self.higher_is_better else -raw_lift

 @property
 def effect_size(self) -> float:
 """Cohen's d: standardized effect size."""
 pooled_std = statistics.stdev(
 self.treatment_values + self.control_values
 )
 if pooled_std == 0:
 return 0.0
 return (self.treatment_mean - self.control_mean) / pooled_std

 def summary(self) -> str:
 direction = "improvement" if self.lift > 0 else "degradation"
 return (
 f"{self.name}: {abs(self.lift):.1%} {direction} "
 f"(d={self.effect_size:.2f})"
 )

# Example: evaluating an AI coding assistant
metrics = [
 ImpactMetric(
 "tasks_completed_per_day",
 treatment_values=[12, 14, 11, 15, 13, 12, 16],
 control_values=[8, 9, 10, 7, 9, 8, 11],
 ),
 ImpactMetric(
 "bugs_per_100_lines",
 treatment_values=[2.1, 1.8, 2.5, 1.9, 2.3, 2.0, 1.7],
 control_values=[1.5, 1.8, 1.2, 1.6, 1.4, 1.3, 1.7],
 higher_is_better=False, # fewer bugs is better
 ),
 ImpactMetric(
 "code_review_time_minutes",
 treatment_values=[25, 30, 22, 28, 35, 27, 24],
 control_values=[15, 18, 12, 16, 14, 13, 17],
 higher_is_better=False, # less time is better
 ),
]

for m in metrics:
 print(m.summary())
# tasks_completed_per_day: 51.0% improvement (d=1.89)
# bugs_per_100_lines: 32.9% degradation (d=1.58)
# code_review_time_minutes: 69.7% degradation (d=2.77)

tasks_completed_per_day: 50.0% improvement (d=1.89) bugs_per_100_lines: 32.9% degradation (d=1.58) code_review_time_minutes: 69.7% degradation (d=2.77)

Code Fragment 35.4.2: Framework for real-world impact measurement

Code 35.4.2: Impact measurement framework illustrating a common finding: AI coding assistants increase throughput but may increase bugs and review time. Net impact requires weighing all three dimensions, not just the flattering one.

Library Shortcut: scipy.stats in Practice

The effect size and lift calculations above are implemented from scratch for pedagogical clarity. In production, use scipy.stats (install: pip install scipy) for rigorous statistical testing with confidence intervals and p-values:

# Production A/B test analysis using scipy
from scipy import stats

treatment = [12, 14, 11, 15, 13, 12, 16]
control = [8, 9, 10, 7, 9, 8, 11]

t_stat, p_value = stats.ttest_ind(treatment, control)
effect_size = (np.mean(treatment) - np.mean(control)) / np.std(treatment + control)
ci = stats.t.interval(0.95, df=len(treatment)-1, loc=np.mean(treatment), scale=stats.sem(treatment))
print(f"p={p_value:.4f}, d={effect_size:.2f}, 95% CI={ci}")

Code Fragment 35.4.3: Production A/B test analysis using scipy

For more comprehensive A/B testing with power analysis and multiple comparison correction, see statsmodels (install: pip install statsmodels), which provides statsmodels.stats.proportion and statsmodels.stats.power modules.

Key Insight

Beware the denominator shift. Many reported AI productivity gains measure the wrong thing. "50% more code per day" sounds impressive, but if code review time doubles and bug rate increases, the net productivity change may be negative. Real-world impact evaluation must measure the entire workflow, not just the assisted step. This is a lesson that applies broadly: the evaluation section of every AI project should measure end-to-end impact, not just the AI component's performance on its immediate task.

5. Data Challenges: Quality, Attribution, and Rights

5.1 Synthetic Data Quality and Model Collapse

As discussed in Chapter 13, synthetic data generated by LLMs is increasingly used to train the next generation of LLMs. The model collapse problem (Shumailov et al., 2023) shows that iteratively training on synthetic data can cause progressive quality degradation: each generation loses diversity and amplifies biases present in the previous generation.

Open research questions include: What fraction of training data can be synthetic before collapse onset? Can diversity-preserving generation strategies prevent collapse entirely? How does the ratio of real-to-synthetic data affect the type and severity of collapse? These questions have immediate practical relevance because the economics of data collection strongly favor synthetic generation, and organizations need guidance on safe usage levels.

5.2 Training Data Attribution and Copyright

When a model generates text that closely resembles a specific training document, who holds the rights? The legal landscape is evolving rapidly, with major lawsuits (NYT v. OpenAI, Getty v. Stability AI) testing the boundaries of fair use for AI training. Research into data attribution, identifying which training examples influenced a specific model output, is both technically challenging and legally significant.

Technical approaches include influence functions (approximating the effect of removing a training example on a specific output), membership inference attacks (determining whether a specific document was in the training set), and watermarking (embedding detectable signals in text that survive model training). None of these is sufficiently reliable for legal use today. Developing robust attribution methods is an open problem at the intersection of machine learning, information theory, and law.

6. Application Frontiers

6.1 Scientific AI: From Hypothesis to Discovery

AI systems have demonstrated remarkable scientific capabilities: AlphaFold predicts protein structures with atomic accuracy, GNoME discovered millions of new crystal structures, and LLMs have been used to generate novel hypotheses in materials science and drug discovery. The open question is whether AI can participate meaningfully in the full scientific process, not just the prediction step, but the creative formulation of hypotheses, the design of experiments, and the interpretation of unexpected results.

Current limitations center on reliability and grounding. LLMs can generate plausible-sounding scientific hypotheses, but they frequently confabulate mechanisms, cite nonexistent papers, and fail to account for known constraints. Connecting LLMs to real experimental data, simulation engines, and verified knowledge bases (a form of RAG for science, extending the patterns from Chapter 20) is an active research area.

6.2 Mathematical Reasoning and Formal Verification

While LLMs have improved dramatically on mathematical benchmarks, they still make errors that a human mathematician would not: sign errors, invalid algebraic steps, and logical non-sequiturs. Integrating LLMs with formal proof assistants (Lean, Coq, Isabelle) offers a path to verified mathematical reasoning, where every step is machine-checked for correctness. Recent systems like AlphaProof and DeepSeek Prover demonstrate that this integration is feasible but far from general.

6.3 Creative Collaboration

The most impactful long-term application of LLMs may be in creative collaboration: not replacing human creativity, but augmenting it. Current systems can draft text, generate images, compose music, and write code. The frontier is moving toward genuine collaboration, where the AI and human iterate together, with the AI understanding the human's creative intent and contributing ideas that the human would not have generated alone.

Measuring creative collaboration quality is itself an open problem. Traditional metrics (grammaticality, coherence, factual accuracy) miss the essence of creativity. New evaluation frameworks are needed that capture novelty, surprise, usefulness, and the degree to which the AI expanded the human's creative space rather than mechanically executing a specification.

7. Building a Personal Research Agenda

For readers who want to contribute to these open problems, the challenge is selecting problems with the highest expected impact given your skills, resources, and interests. The following framework helps prioritize research directions:

Importance. How much does progress on this problem matter? Problems in alignment and safety have high importance because the consequences of failure are severe. Problems in efficiency have high importance because they determine who can access AI capabilities.

Tractability. Is there a plausible path to progress given current tools and knowledge? Some problems (like fully solving interpretability for a 100B-parameter model) may be important but currently intractable. Others (like improving synthetic data quality for specific domains) are tractable with existing methods.

Neglectedness. How many other researchers are working on this? A moderately important but neglected problem may offer more room for impact than a highly important problem where hundreds of researchers are already competing. For example, evaluation methodology receives far less attention than capabilities research, despite being equally important for safe deployment.

Real-World Scenario: Applying the framework

Who: A postdoctoral researcher at a university AI lab with access to a modest 8-GPU cluster, deciding which research direction to pursue for the next two years.

Situation: The researcher identified three promising directions: (1) scaling transformer context to 10M tokens, (2) developing robust attribution methods for training data, and (3) building process-based evaluation for mathematical reasoning.

Problem: Direction (1) was important and tractable but crowded, with every major lab (Google, Meta, OpenAI) actively publishing in the area. Competing with their compute budgets and team sizes was impractical. The researcher needed to find a direction where an 8-GPU cluster and a single researcher could make a meaningful contribution.

Decision: Using the importance/tractability/neglectedness framework, the researcher chose direction (3): process-based evaluation for mathematical reasoning. It scored high on importance (reasoning evaluation was a recognized gap), high on tractability (the work required careful dataset construction and evaluation design, not massive compute), and moderate on neglectedness (few groups were focused on it despite widespread recognition of the problem).

Result: Within 8 months, the researcher published a benchmark and evaluation framework that was adopted by two major labs for their reasoning model development. The work would not have been possible in direction (1), where the same time investment would have produced an incremental result overshadowed by better-resourced competitors.

Lesson: Independent researchers maximize impact by targeting problems that are important and tractable but relatively neglected. The importance/tractability/neglectedness framework helps identify directions where a single breakthrough can shift the field, rather than competing head-to-head with large labs on well-resourced problems.

Regardless of which problem you choose, the tools you have built throughout this book, understanding architectures, training methods, evaluation frameworks, and deployment patterns, give you the foundation to contribute. The frontier of AI research is not reserved for large labs. Many of the breakthroughs cited in this chapter came from small teams, independent researchers, and practitioners who noticed a problem in production and decided to solve it systematically.

Key Takeaways

We still lack a theory of what models learn. Empirical results outpace theoretical understanding, making it difficult to predict capabilities or failure modes before training.
Evaluation methodology has not kept pace with model capabilities. Benchmark saturation, contamination, and narrow task focus mean current evaluations miss important real-world performance dimensions.
Data quality, attribution, and rights are unresolved. Legal frameworks for training data lag behind practice, creating uncertainty for both model developers and content creators.

Research Frontier

The meta-problem: AI for AI research. Perhaps the most consequential near-term development will be the use of AI systems to accelerate AI research itself. AI-assisted code generation already speeds up implementation.

AI-assisted literature review helps researchers find relevant prior work. AI-assisted experiment design could optimize hyperparameter searches and identify promising research directions.

If AI systems become genuinely useful research collaborators, the pace of progress on all the problems listed in this section will accelerate dramatically, for better or worse. The feedback loop between AI capability and AI research capability is the dynamic that makes the next decade so hard to predict.

Exercise

Conceptual

Mapping the Open Problem Landscape

Organize the open research problems discussed in this section into a 2x2 matrix with axes of "importance" (how much impact a solution would have) and "tractability" (how likely a solution is in the next 5 years). Place at least six problems into this matrix. Which quadrant (high importance, high tractability) should the research community prioritize? Are there any problems in the "high importance, low tractability" quadrant that deserve more attention despite their difficulty?

Exercise

Analysis

The Evaluation Gap: Benchmarks vs. Real-World Performance

Current LLM benchmarks (MMLU, HumanEval, etc.) are becoming saturated as models approach ceiling performance. Analyze: (a) three specific ways in which benchmark performance fails to predict real-world utility, (b) the problem of "benchmark gaming" where models are optimized for specific test sets, (c) proposals for next-generation evaluation methods (agent-based evaluation, adversarial testing, longitudinal studies), and (d) whether there can ever be a single number that captures "model intelligence." Propose one concrete evaluation method that addresses a gap in current benchmarks.

Exercise

Discussion

Alignment as an Open Problem

Despite advances in RLHF and constitutional AI, alignment remains an open problem. Discuss: (a) Why is alignment harder than traditional machine learning objectives? (b) What is the distinction between "alignment" (making models do what we want) and "safety" (making models not do what we do not want)? (c) How does the scalable oversight problem (supervising AI systems that are more capable than their supervisors) challenge current approaches? (d) What would a "solved" alignment look like, and how would we know if we had achieved it?

Exercise

Coding

Building a Research Problem Tracker

Write a Python script that creates a structured database (using a dictionary or SQLite) of open research problems. Each entry should include: (a) problem name, (b) category (understanding, safety, efficiency, evaluation, applications), (c) current state of the art, (d) key papers (at least 2 references), (e) estimated difficulty (1 to 5), and (f) potential impact (1 to 5). Populate your database with at least 10 problems from this section. Add a function that queries the database to find the highest-impact problems in a given category.

Exercise

Discussion

Predicting the Next Five Years of AI Research

Based on the open problems and trends discussed throughout this book, make three specific predictions about AI research progress by 2031. For each prediction, state: (a) what you expect to happen, (b) why (citing current trends and evidence), (c) what would need to go right for your prediction to come true, and (d) what could go wrong. Consider at least one prediction about model architectures, one about alignment and safety, and one about applications. Compare your predictions to the historical rate of progress in the field.

What Comes Next

In the next section, Section 35.5: Reliability Engineering for Agents Under Production Stress, we turn to the engineering challenges of building agents that fail safely and recover gracefully under real-world conditions.

References & Further Reading

Understanding LLM Capabilities

Wei, J. et al. (2022). "Emergent Abilities of Large Language Models." TMLR.

Documents abilities that appear abruptly at scale, framing one of the central open problems: can we predict and control capability emergence? Provides the empirical basis for emergence as a research frontier.

📄 Paper

Gurnee, W. and Tegmark, M. (2023). "Language Models Represent Space and Time." arXiv:2310.02207.

Discovers linear representations of spatial and temporal concepts within LLM activations, suggesting models build structured world knowledge. Connects to the open question of whether LLMs develop genuine understanding.

📄 Paper

Xie, S. M. et al. (2022). "An Explanation of In-context Learning as Implicit Bayesian Inference." ICLR 2022.

Proposes a theoretical explanation for in-context learning as implicit Bayesian inference over latent concepts. Addresses one of the deepest open questions in LLM research: how does learning happen without gradient updates?

📄 Paper

Shanahan, M. (2024). "Talking About Large Language Models." Communications of the ACM, 67(2).

A philosophical analysis of what it means to say LLMs "know," "understand," or "believe" things. Clarifies conceptual confusions that obstruct progress on open research questions about LLM cognition.

📄 Paper

Safety & Alignment Frontiers

Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073.

Demonstrates alignment through AI self-critique guided by explicit principles, reducing dependence on human labelers. An active research direction for making alignment scalable and reproducible.

📄 Paper

Burns, C. et al. (2023). "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision." OpenAI Technical Report.

Investigates the fundamental challenge of supervising AI systems smarter than the supervisor. Frames one of the most important open problems for the next generation of alignment research.

📄 Paper

Park, S. et al. (2023). "AI Deception: A Survey of Examples, Risks, and Potential Solutions." Patterns, 4(10).

Catalogs known examples of AI systems exhibiting deceptive behavior and proposes detection strategies. Highlights deception as an underexplored but critical safety frontier.

📄 Paper

Measuring Intelligence

Olah, C. et al. (2020). "Zoom In: An Introduction to Circuits." Distill.

The influential piece that proposed viewing neural networks as compositions of interpretable circuits. Established the conceptual vocabulary and research agenda for mechanistic interpretability.

📄 Paper

Chollet, F. (2019). "On the Measure of Intelligence." arXiv:1911.01547.

Argues that intelligence should be measured by skill-acquisition efficiency rather than task performance. Provides the theoretical framework for evaluating whether AI systems are genuinely becoming more capable.

📄 Paper

Shumailov, I., Shumilo, Z., Zhao, Y., et al. (2024). "AI Models Collapse When Trained on Recursively Generated Data." Nature, 631, 755-759.

Shows that recursive self-training degrades model quality irreversibly. An important open problem for the field: how to sustain improvement as synthetic data becomes prevalent on the internet.

📄 Paper