Part 4: Training and Adapting
Chapter 17: Alignment, RLHF, and DPO

Constitutional AI & Self-Alignment

Give a model a preference label, and it learns one judgment. Give it a constitution, and it learns to judge for itself.

Reward Reward, Constitutionally Minded AI Agent
Big Picture

Constitutional AI replaces thousands of human preference labels with a small set of written principles. Instead of hiring annotators to judge every response pair, CAI asks the model itself to critique and revise its outputs according to a "constitution" of behavioral rules. The model generates, self-critiques, revises, and then these revised outputs serve as training data. This approach (developed by Anthropic) dramatically reduces the cost of alignment data collection and allows alignment behavior to be specified declaratively through principles rather than implicitly through examples. As we saw in Section 17.1, the human annotation required by RLHF is the bottleneck that CAI aims to eliminate. The synthetic data generation techniques from Chapter 13 offer a related but distinct approach to reducing annotation costs.

Prerequisites

This section builds on rlhf concepts from Section 17.1: RLHF: Reinforcement Learning from Human Feedback and dpo and preference optimization covered in Section 17.2: DPO and Modern Preference Optimization.

An angel on a model's shoulder representing constitutional principles guiding the model's self-critique
Figure 17.3.1: Constitutional AI gives the model its own moral compass. Instead of relying solely on human feedback, the model critiques itself against a set of principles.

1. The Human Annotation Bottleneck

Tip

When writing constitutional principles, be specific and actionable. "Be helpful" is too vague for the model to operationalize. "If the user asks a factual question, provide a direct answer with a source citation before any caveats" gives the model a concrete behavior to follow. Good constitutional principles read like a style guide for a human editor: precise enough that two different people would apply them the same way.

Standard RLHF requires large volumes of human preference data. OpenAI's InstructGPT used roughly 33,000 human comparisons. As models grow more capable, the annotation challenge intensifies (recalling the synthetic data principles from Section 13.1): annotators need domain expertise to evaluate complex outputs, agreement rates drop on subtle quality distinctions, and the cost per comparison rises. Furthermore, human preferences are inherently inconsistent; different annotators often disagree on which response is better, and individual annotators may be inconsistent across sessions.

Fun Fact

Constitutional AI asks the model to critique and revise its own outputs based on a set of written principles. It is self-improvement through self-reflection, which sounds like a meditation retreat but with gradient descent.

Mental Model: The Self-Governing Student Body

Think of Constitutional AI as a school that teaches students to govern themselves. Instead of hiring hall monitors (human annotators) for every hallway, you give students a constitution (a set of principles) and train them to evaluate their own behavior against it. The model generates a response, then critiques it against the principles, then revises it. This self-supervision cycle scales far better than human oversight, though the constitution itself must still be carefully designed by humans.

Constitutional AI addresses this bottleneck by replacing most human annotation with AI-generated feedback. The key insight is that a sufficiently capable model can evaluate its own outputs against explicit principles (a form of prompt-driven self-reflection), and these self-evaluations can serve as a training signal. The human role shifts from labeling individual examples to writing the principles (the "constitution") that guide evaluation.

Why this matters: Constitutional AI addresses the fundamental scalability bottleneck of RLHF: human annotators. As models become more capable, the volume of preference data needed for alignment grows, but recruiting, training, and calibrating human annotators does not scale linearly. Constitutional AI replaces human preference judgments with model self-critique guided by explicit principles (a "constitution"). This makes alignment more scalable, more consistent, and more transparent, because the principles are written down rather than implicit in annotator behavior. However, it shifts the challenge from annotator management to principle design: writing constitutional principles that cover all safety-relevant scenarios without being so restrictive that they cripple model usefulness. The alignment tax discussion below quantifies this tradeoff. These safety considerations connect to the broader safety and ethics framework in Chapter 26.

Common Mistake: Confusing Constitutional AI with Prompt-Based Safety Rules

Constitutional AI and system prompt safety instructions may look similar (both involve written rules), but they operate at fundamentally different levels. A system prompt safety rule is applied at inference time and can be bypassed through prompt injection or jailbreaking. Constitutional AI, by contrast, changes the model's weights during training: the model internalizes the principles through self-critique and RLHF, so the safety behaviors persist even without the system prompt. Teams sometimes implement a list of safety rules in their system prompt and believe they have "constitutional AI." They do not. If your safety behavior disappears when the system prompt is removed, you have prompt-level guardrails, not alignment. True constitutional alignment requires a training phase.

2. The Constitutional AI Framework

Constitutional AI operates in two phases. The first phase generates training data through self-critique and revision. The second phase trains a preference model on AI-generated comparisons, replacing the human labelers in standard RLHF. Figure 17.3.1 shows both phases of this pipeline.

Constitution (written principles for behavior) Phase 1: Supervised Self-Critique Harmful prompt Initial response Self-critique (using constitution) Revised response Used for SFT training data Repeat K times (iterative) Phase 2: RLAIF Response A Response B AI Judge (using constitution) AI Preference Dataset
Figure 17.3.2: The Constitutional AI pipeline. Phase 1 uses self-critique and revision to generate SFT data. Phase 2 uses AI-generated preference judgments (RLAIF) to train the reward model or run DPO.

2.1 Phase 1: Critique-Revision Pairs

In Phase 1, the model is presented with potentially harmful or low-quality prompts and generates an initial response. The model then critiques its own response against a specific constitutional principle and produces a revised version. This critique-revision loop can be repeated multiple times, producing progressively better responses.

Code Fragment 17.3.2 implements the full critique-revision loop, defining constitutional principles as structured data classes and applying them iteratively to generate SFT training data from self-critique.

# Constitutional AI: Phase 1 - Self-Critique and Revision
from dataclasses import dataclass
from typing import List

@dataclass
class ConstitutionalPrinciple:
 name: str
 critique_prompt: str
 revision_prompt: str

# Example constitution (simplified from Anthropic's approach)
CONSTITUTION = [
 ConstitutionalPrinciple(
 name="helpfulness",
 critique_prompt=(
 "Identify specific ways in which the assistant's response "
 "is unhelpful, incomplete, or fails to address the user's "
 "actual question."
 ),
 revision_prompt=(
 "Revise the response to be more helpful, complete, and "
 "directly address the user's question."
 ),
 ),
 ConstitutionalPrinciple(
 name="harmlessness",
 critique_prompt=(
 "Identify any content in the response that could be "
 "harmful, dangerous, unethical, or that provides "
 "instructions for illegal activities."
 ),
 revision_prompt=(
 "Revise the response to remove harmful content while "
 "still being as helpful as possible for legitimate uses."
 ),
 ),
 ConstitutionalPrinciple(
 name="honesty",
 critique_prompt=(
 "Identify any claims in the response that are likely "
 "false, misleading, or presented with unwarranted "
 "confidence. Note where uncertainty should be expressed."
 ),
 revision_prompt=(
 "Revise the response to be more truthful, express "
 "appropriate uncertainty, and avoid presenting "
 "speculation as fact."
 ),
 ),
]

def critique_and_revise(model, tokenizer, prompt, response, principle):
 """Apply one critique-revision step using a constitutional principle.

 Note: This is simplified pseudocode. A real implementation would
 use tokenizer(..., return_tensors="pt"), handle device placement,
 set generation parameters, and decode the output tokens.
 """
 # Step 1: Generate critique
 critique_input = (
 f"Here is a conversation:\n\n"
 f"Human: {prompt}\n"
 f"Assistant: {response}\n\n"
 f"Critique request: {principle.critique_prompt}\n"
 f"Critique:"
 )
 critique = model.generate(tokenizer.encode(critique_input))

 # Step 2: Generate revision
 revision_input = (
 f"Here is a conversation:\n\n"
 f"Human: {prompt}\n"
 f"Assistant: {response}\n\n"
 f"Critique: {critique}\n\n"
 f"Revision request: {principle.revision_prompt}\n"
 f"Revised response:"
 )
 revised = model.generate(tokenizer.encode(revision_input))

 return {"critique": critique, "revised_response": revised}

def build_cai_sft_dataset(model, tokenizer, prompts, constitution, rounds=3):
 """Build SFT data from iterative critique-revision."""
 import random
 sft_data = []

 for prompt in prompts:
 # Generate initial (potentially problematic) response
 response = model.generate(tokenizer.encode(prompt))

 # Apply multiple rounds of critique-revision
 for _ in range(rounds):
 principle = random.choice(constitution)
 result = critique_and_revise(
 model, tokenizer, prompt, response, principle
 )
 response = result["revised_response"]

 # Final revised response becomes the SFT target
 sft_data.append({"prompt": prompt, "response": response})

 return sft_data
Alignment Tax Report: Llama-3.1-8B-Instruct vs Base
-------------------------------------------------------
 MMLU : 2.1% regression
 HumanEval : 3.8% regression
 GSM8K : 2.5% improvement
 TruthfulQA : 35.3% improvement
 HellaSwag : 1.8% regression
 Average tax : -5.8%
Code Fragment 17.3.1: Constitutional AI: Phase 1 - Self-Critique and Revision

Code Fragment 17.3.3 wraps the entire Constitutional AI pipeline into an iterative self-improvement loop, running multiple rounds of critique-revision followed by SFT and DPO training with early stopping when quality degrades.

# Iterative Self-Improvement Pipeline
def iterative_self_improvement(
 base_model_path: str,
 constitution: List[ConstitutionalPrinciple],
 prompts: List[str],
 num_rounds: int = 3,
 eval_fn=None,
):
 """Run multiple rounds of self-improvement."""
 from transformers import AutoModelForCausalLM, AutoTokenizer

 current_model_path = base_model_path
 results_per_round = []

 for round_num in range(num_rounds):
 print(f"Round {round_num + 1}/{num_rounds}")

 model = AutoModelForCausalLM.from_pretrained(current_model_path)
 tokenizer = AutoTokenizer.from_pretrained(current_model_path)

 # Phase 1: Generate critique-revision SFT data
 sft_data = build_cai_sft_dataset(
 model, tokenizer, prompts, constitution, rounds=2
 )
 print(f" Generated {len(sft_data)} SFT examples")

 # Phase 2: Generate RLAIF preferences
 pref_data = build_rlaif_dataset(
 model, tokenizer, prompts, constitution
 )
 print(f" Generated {len(pref_data)} preference pairs")

 # Train: SFT on revised responses, then DPO on preferences
 new_model_path = f"./cai-round-{round_num + 1}"
 train_sft(model, sft_data, output_dir=f"{new_model_path}-sft")
 train_dpo(
 f"{new_model_path}-sft", pref_data,
 output_dir=new_model_path
 )

 # Evaluate
 if eval_fn:
 metrics = eval_fn(new_model_path)
 results_per_round.append(metrics)
 print(f" Eval: {metrics}")

 # Early stopping if quality degrades
 if round_num > 0:
 prev = results_per_round[-2]
 curr = results_per_round[-1]
 if curr["quality"] < prev["quality"] * 0.95:
 print(" Quality degradation detected, stopping.")
 break

 current_model_path = new_model_path

 return current_model_path, results_per_round
Code Fragment 17.3.2: Iterative self-improvement pipeline combining CAI Phases 1 and 2. Each round generates critique-revision SFT data, builds RLAIF preference pairs, then trains sequentially with SFT and DPO. Early stopping halts the loop if capability metrics degrade beyond a 5% threshold.
Real-World Scenario: Using CAI to Align a Medical Q&A Assistant

Who: Dr. Meera Patel, head of AI at a telehealth startup.

Situation: Her team was building a patient-facing Q&A assistant that needed to be accurate, empathetic, and strictly compliant with medical safety guidelines, but the startup could not afford the $15,000 cost of physician annotations for a traditional RLHF preference dataset.

Problem: The base fine-tuned model sometimes guessed at diagnoses instead of expressing uncertainty, occasionally asked for personally identifiable health information, and did not consistently redirect emergencies to 911. These failures were unacceptable in a healthcare context.

Decision: Dr. Patel wrote a five-principle constitution covering accuracy ("cite sources for medical claims"), scope ("redirect emergencies to 911"), empathy ("acknowledge patient concerns before answering"), privacy ("never ask for identifiable information"), and uncertainty ("say 'I don't know' rather than guess"). The team ran three rounds of critique-revision on 2,000 patient questions, generating 6,000 revised SFT examples, then used RLAIF to produce 10,000 preference pairs at $0.003 each (total: $30).

Result: The final model scored 94% on a safety evaluation suite with only a 1.2% regression on medical knowledge benchmarks. Total alignment cost was under $500, compared to the estimated $15,000 for equivalent physician annotations.

Lesson: Constitutional AI enables domain experts to encode safety requirements as principles rather than curating thousands of labeled examples. This is especially powerful in specialized domains where expert annotators are expensive and the safety requirements can be articulated as clear rules.

5. The Alignment Tax

A persistent concern in alignment research is the "alignment tax": the cost in general capabilities that alignment training imposes. Models trained with RLHF or CAI sometimes perform worse on benchmarks that measure raw knowledge, reasoning, or coding ability compared to their unaligned base models. This creates a tension between safety and capability.

Warning

The alignment tax is real but often overstated. Careful alignment training with appropriate KL penalties preserves most general capabilities. The bigger risk is over-alignment, where the model becomes excessively cautious, refusing legitimate requests or hedging every statement with unnecessary disclaimers. Finding the right balance requires continuous evaluation across both safety and capability benchmarks. Code Fragment 17.3.2 quantifies this tradeoff by comparing base and aligned model scores across multiple benchmarks.

5.1 Measuring the Alignment Tax

This snippet measures the alignment tax by comparing a base model's task performance before and after RLHF.

# Measuring alignment tax across capability dimensions
from dataclasses import dataclass
from typing import Dict

@dataclass
class AlignmentTaxReport:
 model_name: str
 base_scores: Dict[str, float]
 aligned_scores: Dict[str, float]

 def compute_tax(self) -> Dict[str, float]:
 """Compute per-benchmark alignment tax."""
 tax = {}
 for benchmark in self.base_scores:
 base = self.base_scores[benchmark]
 aligned = self.aligned_scores.get(benchmark, 0)
 tax[benchmark] = (base - aligned) / base * 100
 return tax

 def report(self):
 tax = self.compute_tax()
 print(f"Alignment Tax Report: {self.model_name}")
 print("-" * 55)
 for bench, pct in tax.items():
 direction = "regression" if pct > 0 else "improvement"
 print(f" {bench:25s}: {abs(pct):5.1f}% {direction}")
 avg_tax = sum(tax.values()) / len(tax)
 print(f" {'Average tax':25s}: {avg_tax:5.1f}%")

# Example: comparing base vs. aligned model
report = AlignmentTaxReport(
 model_name="Llama-3.1-8B-Instruct vs Base",
 base_scores={
 "MMLU": 65.2,
 "HumanEval": 42.1,
 "GSM8K": 56.8,
 "TruthfulQA": 38.5,
 "HellaSwag": 78.3,
 },
 aligned_scores={
 "MMLU": 63.8, # small regression
 "HumanEval": 40.5, # small regression
 "GSM8K": 58.2, # improvement (instruction following helps)
 "TruthfulQA": 52.1, # large improvement (alignment goal)
 "HellaSwag": 76.9, # small regression
 },
)
report.report()
Code Fragment 17.3.3: Measuring alignment tax across capability dimensions

6. Shallow Safety Alignment

Research has revealed a concerning phenomenon: safety alignment in current models may be more superficial than it appears. Studies have shown that safety training can be undone with minimal fine-tuning (sometimes as few as 10 to 100 examples of harmful content), suggesting that alignment modifies surface-level behavior rather than deeply changing the model's representations.

Note

The fragility of safety alignment has significant implications for open-weight models. If alignment can be reversed with trivial fine-tuning, then releasing aligned open-weight models provides only a modest speed bump against misuse. This observation motivates research into more robust alignment methods that modify deeper representations, as well as complementary approaches like inference-time guardrails and output filtering.

Figure 17.3.3 contrasts shallow alignment (a thin, removable layer) with the goal of deep alignment (safety integrated into core representations).

Shallow Alignment Safety layer (thin, removable) Core knowledge + capabilities (unchanged by alignment) Safety integrated throughout knowledge + capabilities (robust to fine-tuning attacks)
Figure 17.3.3: Shallow alignment adds a thin safety layer that can be removed by fine-tuning. Deep alignment (the research goal) integrates safety into the model's core representations.
Key Insight

Current alignment techniques (RLHF, DPO, CAI) primarily teach the model when to refuse rather than removing the underlying capability to generate harmful content. This is analogous to teaching someone not to pick locks (see the prompt injection patterns in Section 11.3 for related attack surfaces) rather than making them forget how locks work. True robust alignment likely requires deeper modifications to model representations, which is an active area of research in mechanistic interpretability (Chapter 18). The Section 29.1 can help measure how deep alignment actually goes.

Self-Check
Q1: What are the two phases of Constitutional AI, and what does each produce?
Show Answer
Phase 1 (Supervised Self-Critique) generates SFT training data by having the model critique and revise its own responses against constitutional principles. Phase 2 (RLAIF) generates preference data by having the model judge which of two responses better adheres to the constitution. Phase 1 produces (prompt, revised_response) pairs for SFT. Phase 2 produces (prompt, chosen, rejected) triples for reward model training or DPO.
Q2: How does the human role differ between RLHF and Constitutional AI?
Show Answer
In RLHF, humans label individual preference pairs (comparing specific responses). In Constitutional AI, humans write the constitution (a set of high-level principles). The human effort shifts from labeling thousands of examples to crafting a small set of well-defined behavioral rules. This makes the alignment specification explicit, auditable, and modifiable.
Q3: What is the alignment tax, and how can it be measured?
Show Answer
The alignment tax is the reduction in general capabilities (knowledge, reasoning, coding) that results from alignment training. It is measured by comparing the aligned model's performance on standard benchmarks (MMLU, HumanEval, GSM8K) against the unaligned base model. A well-tuned alignment process minimizes this tax while maximizing safety improvements on benchmarks like TruthfulQA.
Q4: Why is shallow safety alignment a concern for open-weight models?
Show Answer
Research shows that safety alignment can be reversed with minimal fine-tuning (sometimes 10 to 100 harmful examples). For open-weight models where anyone can fine-tune, this means safety training provides only a modest barrier against misuse. The underlying harmful capabilities remain in the model's weights and can be re-exposed with trivial effort.
Q5: What advantage does RLAIF have over human-labeled RLHF in terms of cost and throughput?
Show Answer
RLAIF costs roughly $0.001 to $0.01 per comparison (API token cost) versus $0.50 to $5.00 for human annotation. Throughput increases from hundreds per annotator per day to tens of thousands per hour. RLAIF is also more consistent (no inter-annotator disagreement) and more easily adaptable (update the constitution). The tradeoff is that AI judges may have systematic biases different from human biases.
Key Insight

✅ Key Takeaways

Research Frontier

Constitutional AI is expanding toward democratic constitution design, where diverse groups of stakeholders collaboratively define the principles that guide AI behavior rather than relying solely on researcher-authored rules. Research on self-play for alignment extends the constitutional approach by having models debate each other and iteratively refine their outputs against constitutional principles.

The frontier challenge is creating constitutions that handle genuine value conflicts (such as helpfulness versus privacy) with nuance rather than defaulting to overly cautious refusals.

Exercises

Exercise 17.3.1: Constitutional AI principles Conceptual

Explain the core idea of Constitutional AI. How does writing a set of behavioral principles replace the need for thousands of human preference labels?

Answer Sketch

Constitutional AI (CAI) defines a set of written principles (a 'constitution') that specify desired model behavior (e.g., 'Be helpful, honest, and harmless,' 'Do not generate violent content'). Instead of human annotators comparing responses, the model itself critiques and revises its outputs according to these principles. The self-critique outputs become training data. This replaces expensive human labeling with automated, principle-guided data generation, and makes alignment behavior explicitly specifiable rather than implicitly learned from examples.

Exercise 17.3.2: RLAIF pipeline Conceptual

Describe the RLAIF (RL from AI Feedback) pipeline used in Constitutional AI. How does the model generate its own preference data?

Answer Sketch

Step 1: Generate multiple responses to each prompt. Step 2: For each response, ask the model to critique it against each constitutional principle. Step 3: Ask the model to revise the response based on the critique. Step 4: Use the original and revised responses as preference pairs (revised = chosen, original = rejected). Step 5: Train a reward model on these AI-generated preferences. Step 6: Run PPO using the AI preference reward model. The entire pipeline requires no human preference annotations.

Exercise 17.3.3: Constitution design Coding

Write a constitution (set of 5 to 7 principles) for a customer service chatbot. Each principle should be specific enough to guide self-critique but general enough to apply across many scenarios.

Answer Sketch

1. 'Always provide accurate information about our products and policies. If unsure, say so.' 2. 'Be empathetic to customer frustration without being condescending.' 3. 'Never share other customers' personal information.' 4. 'If you cannot resolve the issue, offer to escalate to a human agent.' 5. 'Keep responses concise (under 150 words) unless the customer asks for detail.' 6. 'Never make promises about refunds or compensation without checking policy.' 7. 'Use professional but friendly language; avoid jargon.'

Exercise 17.3.4: Self-critique effectiveness Analysis

Under what conditions does self-critique (asking the model to evaluate its own output) work well, and when does it fail? How does model size affect self-critique quality?

Answer Sketch

Works well when: (1) the model is large enough to reason about its own outputs (>70B parameters typically), (2) the critique criteria are clear and objective (e.g., 'contains factual errors'), (3) the error is something the model 'knows' is wrong but failed to avoid in the initial generation. Fails when: (1) the model lacks the knowledge to detect its own errors (blind spots), (2) critique criteria are subjective, (3) the model is too small to reliably evaluate complex outputs. Smaller models need external critics.

Exercise 17.3.5: CAI vs. RLHF tradeoffs Discussion

Compare Constitutional AI and traditional RLHF in terms of cost, scalability, transparency, and alignment quality. When would you prefer each approach?

Answer Sketch

CAI advantages: much cheaper (no human annotators), more scalable, transparent (principles are human-readable), easy to update (modify principles, regenerate data). RLHF advantages: captures subtle human preferences that are hard to articulate as rules, may produce more natural-feeling responses. Prefer CAI when: budget is limited, alignment requirements are clearly articulable, you need auditability. Prefer RLHF when: you have budget for annotators, the desired behavior is subjective or hard to formalize, and you need the highest quality alignment.

Tip: Use a KL Penalty to Prevent Reward Hacking

During RLHF, always include a KL divergence penalty between the policy and the reference model. Without it, the model quickly learns to exploit reward model weaknesses, producing high-reward but low-quality outputs.

What Comes Next

In the next section, Section 17.4: RLVR: Reinforcement Learning with Verifiable Rewards, we cover RLVR (Reinforcement Learning with Verifiable Rewards), using automated verification for domains like math and code.

References & Further Reading
Constitutional AI & AI Feedback

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback.

The foundational CAI paper from Anthropic. Describes both the critique-and-revision pipeline and RLAIF training with AI-generated preferences. Defines the constitutional approach that replaces per-example human labels with declarative principles.

📄 Paper

Lee, H., Phatale, S., Mansoor, H., et al. (2023). RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.

Google's systematic study comparing AI-generated preferences to human preferences for RL training. Shows RLAIF can match RLHF quality at significantly lower cost, validating the AI feedback approach at scale.

📄 Paper
Self-Alignment Methods

Sun, Z., Shen, Y., Zhou, Q., et al. (2024). Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. NeurIPS 2024.

Extends the constitutional approach by bootstrapping alignment from a small seed set of principles with minimal human involvement. Demonstrates that models can self-align effectively when given clear behavioral specifications.

📄 Paper

Yuan, W., Pang, R. Y., Cho, K., et al. (2024). Self-Rewarding Language Models. ICML 2024.

Proposes models that iteratively improve by judging their own outputs and training on the resulting preferences. Creates a self-improvement loop where both generation and evaluation quality increase together.

📄 Paper
Related Alignment Research

Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2023). Discovering Latent Knowledge in Language Models Without Supervision. ICLR 2023.

Introduces Contrast-Consistent Search (CCS) for extracting truthful beliefs from LLM representations without labels. Relevant to understanding how models internally represent alignment-relevant concepts.

📄 Paper

Ganguli, D., Lovitt, L., Kernion, J., et al. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.

Comprehensive study of red teaming methods for finding alignment failures. Covers both manual and automated red teaming approaches, with insights on how harmful behavior scales with model size.

📄 Paper