Give a model a preference label, and it learns one judgment. Give it a constitution, and it learns to judge for itself.
Reward, Constitutionally Minded AI Agent
Constitutional AI replaces thousands of human preference labels with a small set of written principles. Instead of hiring annotators to judge every response pair, CAI asks the model itself to critique and revise its outputs according to a "constitution" of behavioral rules. The model generates, self-critiques, revises, and then these revised outputs serve as training data. This approach (developed by Anthropic) dramatically reduces the cost of alignment data collection and allows alignment behavior to be specified declaratively through principles rather than implicitly through examples. As we saw in Section 17.1, the human annotation required by RLHF is the bottleneck that CAI aims to eliminate. The synthetic data generation techniques from Chapter 13 offer a related but distinct approach to reducing annotation costs.
Prerequisites
This section builds on rlhf concepts from Section 17.1: RLHF: Reinforcement Learning from Human Feedback and dpo and preference optimization covered in Section 17.2: DPO and Modern Preference Optimization.
1. The Human Annotation Bottleneck
When writing constitutional principles, be specific and actionable. "Be helpful" is too vague for the model to operationalize. "If the user asks a factual question, provide a direct answer with a source citation before any caveats" gives the model a concrete behavior to follow. Good constitutional principles read like a style guide for a human editor: precise enough that two different people would apply them the same way.
Standard RLHF requires large volumes of human preference data. OpenAI's InstructGPT used roughly 33,000 human comparisons. As models grow more capable, the annotation challenge intensifies (recalling the synthetic data principles from Section 13.1): annotators need domain expertise to evaluate complex outputs, agreement rates drop on subtle quality distinctions, and the cost per comparison rises. Furthermore, human preferences are inherently inconsistent; different annotators often disagree on which response is better, and individual annotators may be inconsistent across sessions.
Constitutional AI asks the model to critique and revise its own outputs based on a set of written principles. It is self-improvement through self-reflection, which sounds like a meditation retreat but with gradient descent.
Think of Constitutional AI as a school that teaches students to govern themselves. Instead of hiring hall monitors (human annotators) for every hallway, you give students a constitution (a set of principles) and train them to evaluate their own behavior against it. The model generates a response, then critiques it against the principles, then revises it. This self-supervision cycle scales far better than human oversight, though the constitution itself must still be carefully designed by humans.
Constitutional AI addresses this bottleneck by replacing most human annotation with AI-generated feedback. The key insight is that a sufficiently capable model can evaluate its own outputs against explicit principles (a form of prompt-driven self-reflection), and these self-evaluations can serve as a training signal. The human role shifts from labeling individual examples to writing the principles (the "constitution") that guide evaluation.
Why this matters: Constitutional AI addresses the fundamental scalability bottleneck of RLHF: human annotators. As models become more capable, the volume of preference data needed for alignment grows, but recruiting, training, and calibrating human annotators does not scale linearly. Constitutional AI replaces human preference judgments with model self-critique guided by explicit principles (a "constitution"). This makes alignment more scalable, more consistent, and more transparent, because the principles are written down rather than implicit in annotator behavior. However, it shifts the challenge from annotator management to principle design: writing constitutional principles that cover all safety-relevant scenarios without being so restrictive that they cripple model usefulness. The alignment tax discussion below quantifies this tradeoff. These safety considerations connect to the broader safety and ethics framework in Chapter 26.
Constitutional AI and system prompt safety instructions may look similar (both involve written rules), but they operate at fundamentally different levels. A system prompt safety rule is applied at inference time and can be bypassed through prompt injection or jailbreaking. Constitutional AI, by contrast, changes the model's weights during training: the model internalizes the principles through self-critique and RLHF, so the safety behaviors persist even without the system prompt. Teams sometimes implement a list of safety rules in their system prompt and believe they have "constitutional AI." They do not. If your safety behavior disappears when the system prompt is removed, you have prompt-level guardrails, not alignment. True constitutional alignment requires a training phase.
2. The Constitutional AI Framework
Constitutional AI operates in two phases. The first phase generates training data through self-critique and revision. The second phase trains a preference model on AI-generated comparisons, replacing the human labelers in standard RLHF. Figure 17.3.1 shows both phases of this pipeline.
2.1 Phase 1: Critique-Revision Pairs
In Phase 1, the model is presented with potentially harmful or low-quality prompts and generates an initial response. The model then critiques its own response against a specific constitutional principle and produces a revised version. This critique-revision loop can be repeated multiple times, producing progressively better responses.
Code Fragment 17.3.2 implements the full critique-revision loop, defining constitutional principles as structured data classes and applying them iteratively to generate SFT training data from self-critique.
# Constitutional AI: Phase 1 - Self-Critique and Revision
from dataclasses import dataclass
from typing import List
@dataclass
class ConstitutionalPrinciple:
name: str
critique_prompt: str
revision_prompt: str
# Example constitution (simplified from Anthropic's approach)
CONSTITUTION = [
ConstitutionalPrinciple(
name="helpfulness",
critique_prompt=(
"Identify specific ways in which the assistant's response "
"is unhelpful, incomplete, or fails to address the user's "
"actual question."
),
revision_prompt=(
"Revise the response to be more helpful, complete, and "
"directly address the user's question."
),
),
ConstitutionalPrinciple(
name="harmlessness",
critique_prompt=(
"Identify any content in the response that could be "
"harmful, dangerous, unethical, or that provides "
"instructions for illegal activities."
),
revision_prompt=(
"Revise the response to remove harmful content while "
"still being as helpful as possible for legitimate uses."
),
),
ConstitutionalPrinciple(
name="honesty",
critique_prompt=(
"Identify any claims in the response that are likely "
"false, misleading, or presented with unwarranted "
"confidence. Note where uncertainty should be expressed."
),
revision_prompt=(
"Revise the response to be more truthful, express "
"appropriate uncertainty, and avoid presenting "
"speculation as fact."
),
),
]
def critique_and_revise(model, tokenizer, prompt, response, principle):
"""Apply one critique-revision step using a constitutional principle.
Note: This is simplified pseudocode. A real implementation would
use tokenizer(..., return_tensors="pt"), handle device placement,
set generation parameters, and decode the output tokens.
"""
# Step 1: Generate critique
critique_input = (
f"Here is a conversation:\n\n"
f"Human: {prompt}\n"
f"Assistant: {response}\n\n"
f"Critique request: {principle.critique_prompt}\n"
f"Critique:"
)
critique = model.generate(tokenizer.encode(critique_input))
# Step 2: Generate revision
revision_input = (
f"Here is a conversation:\n\n"
f"Human: {prompt}\n"
f"Assistant: {response}\n\n"
f"Critique: {critique}\n\n"
f"Revision request: {principle.revision_prompt}\n"
f"Revised response:"
)
revised = model.generate(tokenizer.encode(revision_input))
return {"critique": critique, "revised_response": revised}
def build_cai_sft_dataset(model, tokenizer, prompts, constitution, rounds=3):
"""Build SFT data from iterative critique-revision."""
import random
sft_data = []
for prompt in prompts:
# Generate initial (potentially problematic) response
response = model.generate(tokenizer.encode(prompt))
# Apply multiple rounds of critique-revision
for _ in range(rounds):
principle = random.choice(constitution)
result = critique_and_revise(
model, tokenizer, prompt, response, principle
)
response = result["revised_response"]
# Final revised response becomes the SFT target
sft_data.append({"prompt": prompt, "response": response})
return sft_data
Alignment Tax Report: Llama-3.1-8B-Instruct vs Base ------------------------------------------------------- MMLU : 2.1% regression HumanEval : 3.8% regression GSM8K : 2.5% improvement TruthfulQA : 35.3% improvement HellaSwag : 1.8% regression Average tax : -5.8%
Code Fragment 17.3.3 wraps the entire Constitutional AI pipeline into an iterative self-improvement loop, running multiple rounds of critique-revision followed by SFT and DPO training with early stopping when quality degrades.
# Iterative Self-Improvement Pipeline
def iterative_self_improvement(
base_model_path: str,
constitution: List[ConstitutionalPrinciple],
prompts: List[str],
num_rounds: int = 3,
eval_fn=None,
):
"""Run multiple rounds of self-improvement."""
from transformers import AutoModelForCausalLM, AutoTokenizer
current_model_path = base_model_path
results_per_round = []
for round_num in range(num_rounds):
print(f"Round {round_num + 1}/{num_rounds}")
model = AutoModelForCausalLM.from_pretrained(current_model_path)
tokenizer = AutoTokenizer.from_pretrained(current_model_path)
# Phase 1: Generate critique-revision SFT data
sft_data = build_cai_sft_dataset(
model, tokenizer, prompts, constitution, rounds=2
)
print(f" Generated {len(sft_data)} SFT examples")
# Phase 2: Generate RLAIF preferences
pref_data = build_rlaif_dataset(
model, tokenizer, prompts, constitution
)
print(f" Generated {len(pref_data)} preference pairs")
# Train: SFT on revised responses, then DPO on preferences
new_model_path = f"./cai-round-{round_num + 1}"
train_sft(model, sft_data, output_dir=f"{new_model_path}-sft")
train_dpo(
f"{new_model_path}-sft", pref_data,
output_dir=new_model_path
)
# Evaluate
if eval_fn:
metrics = eval_fn(new_model_path)
results_per_round.append(metrics)
print(f" Eval: {metrics}")
# Early stopping if quality degrades
if round_num > 0:
prev = results_per_round[-2]
curr = results_per_round[-1]
if curr["quality"] < prev["quality"] * 0.95:
print(" Quality degradation detected, stopping.")
break
current_model_path = new_model_path
return current_model_path, results_per_round
Who: Dr. Meera Patel, head of AI at a telehealth startup.
Situation: Her team was building a patient-facing Q&A assistant that needed to be accurate, empathetic, and strictly compliant with medical safety guidelines, but the startup could not afford the $15,000 cost of physician annotations for a traditional RLHF preference dataset.
Problem: The base fine-tuned model sometimes guessed at diagnoses instead of expressing uncertainty, occasionally asked for personally identifiable health information, and did not consistently redirect emergencies to 911. These failures were unacceptable in a healthcare context.
Decision: Dr. Patel wrote a five-principle constitution covering accuracy ("cite sources for medical claims"), scope ("redirect emergencies to 911"), empathy ("acknowledge patient concerns before answering"), privacy ("never ask for identifiable information"), and uncertainty ("say 'I don't know' rather than guess"). The team ran three rounds of critique-revision on 2,000 patient questions, generating 6,000 revised SFT examples, then used RLAIF to produce 10,000 preference pairs at $0.003 each (total: $30).
Result: The final model scored 94% on a safety evaluation suite with only a 1.2% regression on medical knowledge benchmarks. Total alignment cost was under $500, compared to the estimated $15,000 for equivalent physician annotations.
Lesson: Constitutional AI enables domain experts to encode safety requirements as principles rather than curating thousands of labeled examples. This is especially powerful in specialized domains where expert annotators are expensive and the safety requirements can be articulated as clear rules.
5. The Alignment Tax
A persistent concern in alignment research is the "alignment tax": the cost in general capabilities that alignment training imposes. Models trained with RLHF or CAI sometimes perform worse on benchmarks that measure raw knowledge, reasoning, or coding ability compared to their unaligned base models. This creates a tension between safety and capability.
The alignment tax is real but often overstated. Careful alignment training with appropriate KL penalties preserves most general capabilities. The bigger risk is over-alignment, where the model becomes excessively cautious, refusing legitimate requests or hedging every statement with unnecessary disclaimers. Finding the right balance requires continuous evaluation across both safety and capability benchmarks. Code Fragment 17.3.2 quantifies this tradeoff by comparing base and aligned model scores across multiple benchmarks.
5.1 Measuring the Alignment Tax
This snippet measures the alignment tax by comparing a base model's task performance before and after RLHF.
# Measuring alignment tax across capability dimensions
from dataclasses import dataclass
from typing import Dict
@dataclass
class AlignmentTaxReport:
model_name: str
base_scores: Dict[str, float]
aligned_scores: Dict[str, float]
def compute_tax(self) -> Dict[str, float]:
"""Compute per-benchmark alignment tax."""
tax = {}
for benchmark in self.base_scores:
base = self.base_scores[benchmark]
aligned = self.aligned_scores.get(benchmark, 0)
tax[benchmark] = (base - aligned) / base * 100
return tax
def report(self):
tax = self.compute_tax()
print(f"Alignment Tax Report: {self.model_name}")
print("-" * 55)
for bench, pct in tax.items():
direction = "regression" if pct > 0 else "improvement"
print(f" {bench:25s}: {abs(pct):5.1f}% {direction}")
avg_tax = sum(tax.values()) / len(tax)
print(f" {'Average tax':25s}: {avg_tax:5.1f}%")
# Example: comparing base vs. aligned model
report = AlignmentTaxReport(
model_name="Llama-3.1-8B-Instruct vs Base",
base_scores={
"MMLU": 65.2,
"HumanEval": 42.1,
"GSM8K": 56.8,
"TruthfulQA": 38.5,
"HellaSwag": 78.3,
},
aligned_scores={
"MMLU": 63.8, # small regression
"HumanEval": 40.5, # small regression
"GSM8K": 58.2, # improvement (instruction following helps)
"TruthfulQA": 52.1, # large improvement (alignment goal)
"HellaSwag": 76.9, # small regression
},
)
report.report()
6. Shallow Safety Alignment
Research has revealed a concerning phenomenon: safety alignment in current models may be more superficial than it appears. Studies have shown that safety training can be undone with minimal fine-tuning (sometimes as few as 10 to 100 examples of harmful content), suggesting that alignment modifies surface-level behavior rather than deeply changing the model's representations.
The fragility of safety alignment has significant implications for open-weight models. If alignment can be reversed with trivial fine-tuning, then releasing aligned open-weight models provides only a modest speed bump against misuse. This observation motivates research into more robust alignment methods that modify deeper representations, as well as complementary approaches like inference-time guardrails and output filtering.
Figure 17.3.3 contrasts shallow alignment (a thin, removable layer) with the goal of deep alignment (safety integrated into core representations).
Current alignment techniques (RLHF, DPO, CAI) primarily teach the model when to refuse rather than removing the underlying capability to generate harmful content. This is analogous to teaching someone not to pick locks (see the prompt injection patterns in Section 11.3 for related attack surfaces) rather than making them forget how locks work. True robust alignment likely requires deeper modifications to model representations, which is an active area of research in mechanistic interpretability (Chapter 18). The Section 29.1 can help measure how deep alignment actually goes.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
✅ Key Takeaways
- Constitutional AI replaces human preference annotation with AI self-critique guided by explicit written principles, reducing cost by 100x or more.
- The two-phase approach (self-critique for SFT data, AI judging for preferences) can match RLHF quality when the AI feedback provider is capable enough.
- RLAIF generalizes CAI by using any strong model as a feedback provider, enabling scalable alignment data generation.
- Iterative self-improvement shows initial gains but faces diminishing returns and potential capability degradation after 2 to 3 rounds.
- The alignment tax is real but manageable; well-tuned alignment preserves most capabilities while improving safety benchmarks significantly.
- Current alignment is shallow: safety behavior can be removed with minimal fine-tuning, motivating research into more robust alignment approaches.
Constitutional AI is expanding toward democratic constitution design, where diverse groups of stakeholders collaboratively define the principles that guide AI behavior rather than relying solely on researcher-authored rules. Research on self-play for alignment extends the constitutional approach by having models debate each other and iteratively refine their outputs against constitutional principles.
The frontier challenge is creating constitutions that handle genuine value conflicts (such as helpfulness versus privacy) with nuance rather than defaulting to overly cautious refusals.
Exercises
Explain the core idea of Constitutional AI. How does writing a set of behavioral principles replace the need for thousands of human preference labels?
Answer Sketch
Constitutional AI (CAI) defines a set of written principles (a 'constitution') that specify desired model behavior (e.g., 'Be helpful, honest, and harmless,' 'Do not generate violent content'). Instead of human annotators comparing responses, the model itself critiques and revises its outputs according to these principles. The self-critique outputs become training data. This replaces expensive human labeling with automated, principle-guided data generation, and makes alignment behavior explicitly specifiable rather than implicitly learned from examples.
Describe the RLAIF (RL from AI Feedback) pipeline used in Constitutional AI. How does the model generate its own preference data?
Answer Sketch
Step 1: Generate multiple responses to each prompt. Step 2: For each response, ask the model to critique it against each constitutional principle. Step 3: Ask the model to revise the response based on the critique. Step 4: Use the original and revised responses as preference pairs (revised = chosen, original = rejected). Step 5: Train a reward model on these AI-generated preferences. Step 6: Run PPO using the AI preference reward model. The entire pipeline requires no human preference annotations.
Write a constitution (set of 5 to 7 principles) for a customer service chatbot. Each principle should be specific enough to guide self-critique but general enough to apply across many scenarios.
Answer Sketch
1. 'Always provide accurate information about our products and policies. If unsure, say so.' 2. 'Be empathetic to customer frustration without being condescending.' 3. 'Never share other customers' personal information.' 4. 'If you cannot resolve the issue, offer to escalate to a human agent.' 5. 'Keep responses concise (under 150 words) unless the customer asks for detail.' 6. 'Never make promises about refunds or compensation without checking policy.' 7. 'Use professional but friendly language; avoid jargon.'
Under what conditions does self-critique (asking the model to evaluate its own output) work well, and when does it fail? How does model size affect self-critique quality?
Answer Sketch
Works well when: (1) the model is large enough to reason about its own outputs (>70B parameters typically), (2) the critique criteria are clear and objective (e.g., 'contains factual errors'), (3) the error is something the model 'knows' is wrong but failed to avoid in the initial generation. Fails when: (1) the model lacks the knowledge to detect its own errors (blind spots), (2) critique criteria are subjective, (3) the model is too small to reliably evaluate complex outputs. Smaller models need external critics.
Compare Constitutional AI and traditional RLHF in terms of cost, scalability, transparency, and alignment quality. When would you prefer each approach?
Answer Sketch
CAI advantages: much cheaper (no human annotators), more scalable, transparent (principles are human-readable), easy to update (modify principles, regenerate data). RLHF advantages: captures subtle human preferences that are hard to articulate as rules, may produce more natural-feeling responses. Prefer CAI when: budget is limited, alignment requirements are clearly articulable, you need auditability. Prefer RLHF when: you have budget for annotators, the desired behavior is subjective or hard to formalize, and you need the highest quality alignment.
During RLHF, always include a KL divergence penalty between the policy and the reference model. Without it, the model quickly learns to exploit reward model weaknesses, producing high-reward but low-quality outputs.
What Comes Next
In the next section, Section 17.4: RLVR: Reinforcement Learning with Verifiable Rewards, we cover RLVR (Reinforcement Learning with Verifiable Rewards), using automated verification for domains like math and code.
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback.
The foundational CAI paper from Anthropic. Describes both the critique-and-revision pipeline and RLAIF training with AI-generated preferences. Defines the constitutional approach that replaces per-example human labels with declarative principles.
Google's systematic study comparing AI-generated preferences to human preferences for RL training. Shows RLAIF can match RLHF quality at significantly lower cost, validating the AI feedback approach at scale.
Extends the constitutional approach by bootstrapping alignment from a small seed set of principles with minimal human involvement. Demonstrates that models can self-align effectively when given clear behavioral specifications.
Yuan, W., Pang, R. Y., Cho, K., et al. (2024). Self-Rewarding Language Models. ICML 2024.
Proposes models that iteratively improve by judging their own outputs and training on the resulting preferences. Creates a self-improvement loop where both generation and evaluation quality increase together.
Introduces Contrast-Consistent Search (CCS) for extracting truthful beliefs from LLM representations without labels. Relevant to understanding how models internally represent alignment-relevant concepts.
Comprehensive study of red teaming methods for finding alignment failures. Covers both manual and automated red teaming approaches, with insights on how harmful behavior scales with model size.
