Section 20.7: Leading Video Models: Sora, Veo, Runway, Kling, and Pika

"In 2023 video AI was a curiosity. In 2024 it became a craft. In 2025 it shipped Hollywood. In 2026 we are arguing about union contracts."
Pixel, Sora-Watching AI Agent

Big Picture

The capability and accessibility of video generation have changed roughly twice as fast as image generation did at the same point in the technology's life. Sora's preview launch in February 2024 was the public turning point; Veo 2 (Google, December 2024), Runway Gen-4 (June 2025), Kuaishou's Kling 2.0 (early 2025), Pika 2.0 (late 2024), Sora 2 (October 2025), and Veo 3 (April 2025) followed in tight succession. This section walks the 2026 commercial frontier, builds a capability matrix across the dimensions that matter for production (length, fidelity, motion, character consistency, audio, control), surveys the open alternatives that approach but do not match the closed frontier, and discusses production cost and latency. The architectural backbone of every system on the chart is the video DiT from Section 33.1; the differences are in scale, data, training tricks, and the integration layer around the model.

Prerequisites

This section assumes the video DiT architecture from Section 20.6. Familiarity with the closed-versus-open trade-off from Section 10.6 helps you read the capability matrix. Platform-selection patterns are revisited in the tools-of-the-trade module at the end of this part.

20.7.1 The 2026 Frontier: Sora 2, Veo 3, Runway Gen-4

Three models hold the consumer-grade frontier in mid-2026: OpenAI's Sora 2, Google's Veo 3, and Runway Gen-4. They differ less in absolute quality (all three produce broadcast-acceptable output) and more in their strengths.

Sora 2 (OpenAI, October 2025) shipped with native audio generation, character consistency across multi-shot sequences (the "cameos" feature lets a user define a character and reuse it across many shots), and 60-second-plus clips at 1080p. The audio is the most striking new capability: Sora 2 generates synchronized sound effects, ambient audio, and intelligible dialogue alongside the video in a single forward pass, eliminating the need for a separate Foley step. The character-consistency feature works by training the model to attend to a special "subject" token derived from a few reference frames; subsequent generations reuse the same token.

Veo 3 (Google DeepMind, April 2025) leads on visual fidelity and physical-world realism. The model is rumored to have been trained with extensive simulation-grounded data plus reinforcement signals from a perception model, giving it stronger handling of fluid dynamics, cloth, hair, and crowd behavior. Veo 3 also has the best handling of complex camera moves (a topic Section 33.3 covers in depth). It does not produce audio in the same forward pass; Google ships it alongside a separate Lyria-2-based music generator that the user composes in post.

Runway Gen-4 (June 2025) targets the professional filmmaker rather than the consumer prompt-user. Its interface treats video generation as part of a non-linear editor: you upload reference images, character sheets, and storyboard frames, and the model fills in shots that match the reference style and characters. Runway's strength is the integration layer, not the raw model quality (Sora 2 and Veo 3 produce somewhat better individual shots in head-to-head blind tests); for a working production team, the difference between "best model in a notebook" and "best workflow in a film tool" matters more than the difference between FVD scores.

20.7.2 The Second Tier: Kling, Pika, Luma Dream Machine

A second tier of models occupies the $5-15 per minute price band and produces 80-90% of the quality of the frontier at a tenth of the cost. Kling 2.0 (Kuaishou, March 2025) is the dominant model in China; it produces excellent character motion and was the first commercial system to ship physically grounded long-form video. Pika 2.0 (Pika Labs, late 2024) offers a strong consumer creator product with a friendly UI and good handling of stylized animation (anime, cartoons, motion-graphics styles). Luma Dream Machine (Luma AI, mid-2024) introduced the camera-control workflow that Veo 3 later improved on; Luma's product is now more of a specialist tool for camera-driven shots than a general-purpose video generator.

These second-tier models are often the right choice for production teams that need predictable throughput, stable pricing, and acceptable quality for B-roll, social-media content, or stylized animation. The first-tier (Sora 2, Veo 3, Gen-4) is the right choice when shot quality is the hero of the project and the budget can support $50-200 per minute of finished content.

Key Insight

Quality is now adequate; control is the differentiator

The 2024 era of video AI was about getting a coherent 4-second clip out at all. The 2026 era is about controllability: can you specify the camera path, can you maintain a character across cuts, can you reproduce a specific shot style, can you make characters lip-sync to provided dialogue, can you edit a generated shot without regenerating from scratch. Sora 2's cameos, Veo 3's camera control, Runway Gen-4's reference-driven workflow, and Kling 2.0's motion-brush are all attempts to expose the same underlying capability (which the DiT already has) through different control interfaces. The model architectures across these systems are converging; the differentiation is in the surface area exposed to the user.

20.7.3 Capability Matrix

Model	Max Length	Max Resolution	Audio Out	Character Consist.	Camera Control	Price (est)
Sora 2 (Pro)	~2 min	1080p	Native (sync)	Cameos (multi-shot)	Prompt-based	$0.30-1.00/sec
Veo 3 (Ultra)	~3 min	1080p (4K extension)	No (separate Lyria)	Subject refs	Path-conditioned	$0.40-1.20/sec
Runway Gen-4	~30 s	1080p	No	Strong (refs + sheet)	Strong (motion brush, drag)	$0.20-0.80/sec
Kling 2.0	~2 min	1080p	Beta (audio v1)	Subject refs	Motion brush	$0.10-0.30/sec
Pika 2.0	~10 s loop / 60 s scene	1080p	No	Subject refs	Camera presets	$0.05-0.20/sec
Luma Dream Machine 1.6	~10 s	1080p	No	Subject refs	Camera path UI	$0.10-0.30/sec
Sora Turbo (free tier)	20 s	720p	No	Limited	Prompt-based	Bundled w/ ChatGPT
HunyuanVideo (open)	~5 s	720p	No	Limited	Prompt-based	Self-host (~$0.05/sec)
CogVideoX (open)	~6 s	720p	No	Limited	Prompt-based	Self-host (~$0.04/sec)
Open-Sora 1.2 (open)	~16 s	720p	No	Limited	Prompt-based	Self-host (~$0.06/sec)

Table 20.7.1: 2026 video-generation capability matrix. Prices are author-aggregated rough estimates that depend heavily on tier and bulk pricing; self-host costs assume on-demand A100 rental at $1.50/hr.

20.7.4 Evaluation: FVD, VBench, and Human Preference

Video generation lacks a universally accepted single metric, in the way that image generation has FID and CLIP score. The two automatic metrics most often cited are FVD (Fréchet Video Distance, adapted from FID by computing feature statistics on Inception-3D features over short clips) and VBench (Huang et al., 2024), a battery of 16 quality dimensions including subject consistency, motion smoothness, dynamic range, scene coherence, and aesthetic quality. Both correlate with human preference but neither is a perfect proxy: a model that scores well on FVD can produce shots that humans clearly prefer worse than a model that scores poorly.

In practice, frontier-comparison work in 2025-2026 has converged on arena-style human preference evaluations: paired blind comparisons where humans pick the better of two videos given a shared prompt, aggregated across thousands of prompts. The Chatbot Arena equivalent for video is the Hailuo Arena (Minimax), the LMSys Video Arena, and Artificial Analysis's video leaderboard. These rankings shift month-by-month as new models ship, and the standings as of mid-2026 are: Sora 2, Veo 3, Kling 2.0, Runway Gen-4, Pika 2.0 in roughly that order on overall arena score, with sub-rankings that vary substantially by category (motion-heavy, character-driven, abstract).

# Generate a video with Runway Gen-4 via the official Python SDK
# pip install runwayml
import os
import time
from runwayml import RunwayML

client = RunwayML(api_key=os.environ["RUNWAYML_API_KEY"])

# Gen-4 supports image-to-video, text-to-video, and reference-driven generation.
# Here we use a reference image as the first frame plus a motion prompt.
task = client.image_to_video.create(
    model="gen4_turbo",
    prompt_image="https://example.com/reference_frame.jpg",
    prompt_text=(
        "the character turns their head slowly to look at the camera, "
        "warm golden-hour lighting, shallow depth of field, cinematic"
    ),
    duration=10,                  # seconds; Gen-4 caps at 30
    ratio="16:9",
    seed=42,
)
print(f"Task {task.id} queued")

# Poll until completion; Gen-4 turbo typically returns in 2-4 minutes for 10s.
while True:
    status = client.tasks.retrieve(task.id)
    print(f"  state={status.status} progress={status.progress}")
    if status.status in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(15)

if status.status == "SUCCEEDED":
    print(f"Output URL: {status.output[0]}")

Output: Task tsk_AB12CD34 queued state=PENDING progress=0.0 state=RUNNING progress=0.31 state=RUNNING progress=0.78 state=SUCCEEDED progress=1.0 Output URL: https://runway.cdn.../gen4_turbo_AB12CD34.mp4

Code Fragment 20.7.1a: Generating a 10-second image-conditioned video with Runway Gen-4 Turbo. The pattern (submit a task, poll, retrieve URL) is identical across the major commercial vendors; only the parameter names differ. Plan for 1-5 minute latencies and design your pipeline to be async rather than synchronous.

20.7.5 Open Alternatives: CogVideoX, HunyuanVideo, Wan 2.1, Open-Sora

The open-weight video ecosystem in 2026 has four serious players. CogVideoX-5B remains the easiest model to fine-tune; the community has produced hundreds of LoRAs and full fine-tunes on Hugging Face. HunyuanVideo-13B (Tencent, December 2024) is the highest-quality open weight, competitive with Kling 2.0 on standard benchmarks; it requires a multi-GPU setup (4x A100 minimum for inference). Wan 2.1 (Alibaba, March 2025) introduced an explicit causal VAE that supports streaming generation and was the first open model to ship credible long-form clips (60-second cap). Open-Sora 2.0 (PKU-YuanGroup, late 2025) is the largest open Sora-style implementation, with 13B parameters and an open dataset of around 80k hours.

These four serve different community roles. CogVideoX is the teaching model and the fine-tune base. HunyuanVideo is the quality reference if you have the hardware. Wan 2.1 is the streaming-and-long-form reference. Open-Sora is the recipe-transparent reference. None matches the closed frontier on out-of-the-box quality, but all four are good enough for product use cases where licensing flexibility, on-premise deployment, or domain-specific fine-tuning matters more than the absolute quality ceiling.

Warning: Open-source license terms vary wildly

"Open weight" is not "open source" and is not "Apache 2.0". CogVideoX uses the Apache 2.0 license. HunyuanVideo uses a Tencent custom license with commercial-use restrictions for entities with over 100M monthly active users. Wan 2.1 ships under an Alibaba custom license with similar large-platform restrictions. Open-Sora 2.0 is more permissive but its training data includes scraped content that may carry inherited rights. Always read the actual license before deploying any open video model in a commercial product; "open" does not mean "no obligations."

20.7.6 Production Cost, Latency, and Throughput

Video generation is the most expensive multimodal modality in production. A 10-second 1080p Sora 2 generation consumes roughly 100-200 GPU-seconds on an H100; pricing it at $0.30-$1.00 per second of output reflects the hardware cost plus margin. For comparison, a single image from Stable Diffusion 3 is about 0.5 GPU-seconds, and a 30-second piece of audio from MusicGen is about 5 GPU-seconds; video is 50-100x more expensive per second of output.

Throughput is similarly constrained. A frontier model like Sora 2 takes 1-5 minutes of wall-clock time per generation; a typical production team running daily content cannot replace humans with synchronous calls and has to design async batch workflows around the latency. The product implication is that video AI is currently a "send your request, get an email when it is ready" UX rather than the "type a prompt and see it in 3 seconds" UX of image generation. This will change as inference efficiency improves, but the 50-100x compute gap relative to images is structural and will persist.

Real-World Scenario: Streaming Service Promo Trailer

A streaming service producing weekly 30-second promo trailers for new shows assembled its 2025 workflow as: (1) a writer drafts a 5-shot storyboard with scene descriptions and character references, (2) the team uses Runway Gen-4 to produce all 5 shots conditioned on the show's actual cast reference images (consented and licensed), (3) Veo 3 is used to upscale and stabilize the best shots, (4) Suno V5 generates the music bed, (5) ElevenLabs handles the voice-over read by a cloned (consented) celebrity voice, (6) a human editor cuts the trailer together in DaVinci Resolve. Total compute cost is about $400 per trailer. The comparable traditional shoot would be $50,000 to $250,000. The streaming service does not replace the high-end trailers with AI; it uses AI for the long tail of mid-tier shows that previously did not justify a custom trailer at all.

20.7.7 The Arms Race and Its Limits

The pace of capability improvement in video AI shows no sign of slowing, but several structural limits will shape the next 18 months. Training compute is the most visible: Sora 2's reported training cost is in the high tens of millions of dollars, and a hypothetical Sora 3 at 10x scale would require a billion-dollar training run that only Google, OpenAI, and Microsoft can finance. Training data is the second: high-quality video with clean captions is scarce, and the synthetic-data trick that worked for image and text generation (use a strong model to label data for the next-generation model) is harder for video because video labels are richer (camera path, character identity, motion description) and synthetic labeling errors compound.

The third limit is product. Even the best 2026 model produces shots that need human selection and editing; the prompt-to-finished-video pipeline that some 2024 commentators predicted has not materialized. The 2026 production workflow is human creative direction plus AI shot generation plus human editing, and the editing-and-curation step has not been automated away. The arms race in capability has not yet translated into elimination of the human role; it has translated into different humans doing different work at different scales.

Fun Fact: The "10-finger problem" for video

Image generation has the classic "10-finger problem": diffusion models historically struggled to produce hands with the correct count of fingers. The video equivalent is what practitioners call the "object permanence problem": objects that briefly leave the frame often come back with subtly different geometry, color, or identity. Sora 2 and Veo 3 are the first generation that handle short occlusions reliably; longer occlusions (an object hidden for 5+ seconds and then revealed) are still unreliable. The fix appears to be longer-context attention plus character-token-like persistence tricks; the bug will likely be largely resolved by 2027.

Key Insight

The 2026 frontier of video generation is three closed models (Sora 2, Veo 3, Runway Gen-4) at $0.20-$1.20 per second of output, three solid second-tier products (Kling 2.0, Pika 2.0, Luma Dream Machine) at $0.05-$0.30 per second, and four open weights (CogVideoX, HunyuanVideo, Wan 2.1, Open-Sora 2.0) that approach but do not match the frontier. The capability matrix that matters in production is length, fidelity, motion, character consistency, audio, and control; no single model wins on all six axes, and most production teams use 2-3 models in a single pipeline. Cost per second is 50-100x higher than image generation; the latency is minutes rather than seconds; the integration pattern is asynchronous task submission.

Self-Check

Q1: The capability matrix in Figure 20.7.1b shows that Sora 2 has native audio while Veo 3 does not. Sketch the architectural difference (one DiT vs. DiT-plus-separate-audio-model) and explain which approach is more likely to produce well-synchronized output and which is more likely to scale to longer durations.

Show Answer

Sora 2 packs both video and audio tokens into a single DiT sequence and lets full self-attention learn lip-sync, footsteps, ambient room tone, and on-screen sound effects jointly with the visual content. Veo 3 generates video alone and then composes a separate Lyria-2-based audio model in post. The unified architecture is the one more likely to produce well-synchronized output, because lip-sync and Foley share the same attention layers as the pixels they correspond to. The two-model architecture is more likely to scale to longer durations, because audio tokens stay out of the video DiT's quadratic-attention budget, leaving the per-step token count free to grow with duration.

Q2: FVD scores correlate imperfectly with human preference. List three failure modes (where a model has good FVD but humans prefer a different model) and design an automatic metric that targets one of them.

Show Answer

Three failure modes are (1) a model that hits realistic per-frame texture statistics but flickers identity (object permanence breaks), so FVD looks fine but humans flag the discontinuity; (2) a model that nails motion smoothness while ignoring the prompt's specified action (high FVD on motion subscores, low semantic adherence); (3) a model that produces realistic but boring shots with no dynamic range, scoring well on Inception-3D features that reward "natural" but losing arena ratings. A targeted metric for (1) is "subject-token cosine consistency": run a vision encoder on the dominant subject crop in every frame, compute pairwise cosine similarity across the clip, and flag clips whose minimum pairwise similarity falls below a threshold. This isolates the permanence break that FVD averages away.

Q3: The cost gap between video generation and image generation is 50-100x per second of output. Identify the two largest contributors (tokens-per-output, denoising-steps-per-output, or model-parameters) to that gap and forecast how they will evolve over the next two years.

Show Answer

The two largest contributors are tokens-per-output (a 5-second 1080p generation has roughly 30 to 80 times more DiT tokens than a single 512x512 image, and self-attention is quadratic in this count) and model-parameters (frontier video DiTs are 5 to 13B versus 2 to 8B for the strongest image DiTs). Denoising-step count is roughly equal once both modalities adopt rectified flow with 20 to 50 steps. Over the next two years, sparse and hierarchical attention plus state-space hybrids should compress the token cost by an estimated 3 to 5x, while parameter counts may rise rather than fall, so the realistic forecast is that the gap closes from roughly 75x to roughly 20 to 30x, not to parity.

In the next section, Section 20.8: Camera Control, Motion Control, and ControlNet for Video, we continue.

What Comes Next

Section 33.3 dives into the control layer: how do you tell a video model to do a specific camera move, hold a specific character constant, or follow a specific motion path? Camera control, motion brush, and ControlNet-for-video are the techniques that let frontier video models produce intentional rather than just plausible output.

Further Reading

OpenAI (2025). Sora 2 System Card and Technical Notes. openai.com/index/sora-2-system-card. The product technical brief documenting cameos, audio, and safety mitigations.
Google DeepMind (2025). Veo 3: Highest-Quality Video Generation. deepmind.google/technologies/veo. The Veo 3 product page; documents fidelity and physics improvements.
Runway (2025). Gen-4: Reference-Driven Video Generation for Professionals. runwayml.com/research/gen-4. The professional-tool-oriented frontier model; reference-and-character-sheet workflows.
Kuaishou (2025). Kling 2.0 Technical Report. kling.kuaishou.com. The China-frontier model and its physics-aware long-form video.
Huang, Z. et al. (2024). VBench: Comprehensive Benchmark Suite for Video Generative Models. CVPR 2024. arXiv:2311.17982. The 16-dimensional automatic-eval benchmark used in most 2024-2025 academic comparisons.
Artificial Analysis (2025). Video Generation Leaderboard. artificialanalysis.ai/text-to-video. Continuously updated arena-style human-preference leaderboard.
Wan, X. et al. (2025). Wan 2.1: Open Foundation Models for Video Generation. arXiv:2503.20314. Alibaba's open long-form-capable video DiT; introduced the streaming-friendly causal VAE.
Reuters (2025). Hollywood Unions, AI Video, and the 2024-2026 Contract Renegotiations. reuters.com. Industry coverage of the labor side of frontier video AI deployment.