Camera Control, Motion Control, and ControlNet for Video

Section 20.8

"Anyone can prompt a model to make a video. The craft is making it move the way you imagined."

PixelPixel, Camera-Controlling AI Agent
Big Picture

A frontier video model from Section 33.2 produces a plausible shot from a prompt. A frontier video model from a craftsperson produces the shot they had in mind. The bridge between the two is the control layer: camera path conditioning (dolly, pan, orbit, crane moves), motion brush (mask out a region and indicate the direction it should move), drag-based control (click and drag to specify how an object travels through the frame), pose and skeleton conditioning (animate a character to match a reference motion), and ControlNet-style multi-channel guidance (depth, edge, segmentation maps that anchor the generation). This section walks each control modality, shows how they plug into the underlying video DiT, and demonstrates AnimateDiff as the open recipe most accessible for experimentation.

Prerequisites

This section assumes the video DiT internals from Section 20.6 and the ControlNet image-conditioning pattern from Section 19.10. The camera-pose vocabulary is revisited in the 3D-generation chapter later in this part.

20.8.1 The Control Problem and the Spectrum of Solutions

A pure text-to-video model has one input (the text prompt) and emits an uncontrolled distribution of outputs. For most production purposes this is too loose: the cinematographer wants this specific camera move, the animator wants this specific motion, the editor wants this specific timing. Adding controls to a video DiT is an example of the broader pattern from image generation (ControlNet, IP-Adapter, T2I-Adapter): inject additional conditioning signals at training time so the model learns to respect them, then expose those signals at inference time as user-controllable inputs.

The spectrum of video controls spans, roughly in order of increasing specificity:

Production systems compose these. Veo 3 ships camera-path control natively; Runway Gen-4 ships motion brush plus drag-based control plus reference-driven character consistency; AnimateDiff (open) ships pose conditioning and ControlNet channels via plug-in motion modules. Choosing the right control modality depends on what aspect of the shot you want to lock down.

Key Insight: Control is conditioning, conditioning is tokens

Every video-control modality at the architectural level is the same thing: an additional set of input tokens (or an additional cross-attention stream) that the DiT attends to during denoising. Camera path becomes a sequence of 6-DoF tokens per frame. Motion brush becomes a 2D vector field per frame. Pose becomes a skeleton-token sequence. ControlNet channels become an extra image-channel stream per frame. The architecture does not change; the conditioning vocabulary grows. This is the same trick that turned image ControlNet into the universal control framework for diffusion models in 2023, applied one dimension up.

20.8.2 Camera Path Conditioning

Camera path conditioning is the most useful control modality for narrative cinematography. The user specifies a camera trajectory (a sequence of 6-DoF poses: position + orientation per frame, or a named move like "slow dolly in", "orbit clockwise", "crane up"), and the model generates a video whose implicit camera matches that path. The first credible commercial implementation was MotionCtrl (Wang et al., 2023, arXiv:2312.03641), an adapter that retrofitted camera-path conditioning onto Stable Video Diffusion. CameraCtrl (He et al., 2024) followed for SVD with cleaner training and stronger generalization. By 2025 the major commercial systems shipped native camera control: Luma Dream Machine, Pika, Kling, and Veo 3 all expose 6-DoF camera input either as a parameter to the API or as an interactive 3D widget in the UI.

The training trick that makes camera control work is to extract camera trajectories from video training data using off-the-shelf SLAM or structure-from-motion (COLMAP, DROID-SLAM), then condition the DiT on the extracted camera path as an additional input. At inference, the user specifies the desired path and the model generates a video that respects it. Quality is good for short clips with simple moves; complex multi-segment paths (a dolly-in followed by a pan followed by a crane move) still have artifacts at the seams between segments.

20.8.3 Motion Brush and Region Direction Arrows

Motion brush is the most intuitive control modality for non-technical users. The user paints a region of the first frame (the "brush") and draws an arrow indicating how that region should move over the duration of the clip. The most common implementations support multiple regions (e.g., "this person walks left, this car drives right, the background stays still") and intensity control (gentle motion vs. fast motion).

The architectural implementation conditions the DiT on a per-frame 2D vector field plus per-region masks. The vector field encodes the desired motion direction and magnitude at each pixel; the masks indicate which regions are under user control versus which should follow the prompt's natural motion. Kling 2.0 introduced motion brush as a core feature in 2024; Pika 2.0 followed with a UI variation called Pikaffects. Both are particularly well-suited for stylized animation work (anime, 2D motion graphics) where the desired motion is precisely controlled.

20.8.4 Drag-Based Control: DragNUWA and Its Descendants

Drag-based control gives the user point-level precision: click on a pixel in the first frame, drag to where that pixel should end up at the end of the clip, and the model produces a video where that point travels along the specified trajectory. DragNUWA (Yin et al., 2023, arXiv:2308.08089) was the first widely-known implementation; DragAnything (Wu et al., 2024) extended the approach to handle multiple drag points with strict trajectory adherence. Runway Gen-4's "Motion Vector" tool is the polished commercial version of this idea.

The control is precise but fragile. For simple object translations (a ball moving across a table, a character's hand reaching to grab a cup) it works well. For complex transformations (a face shape-shifting, a character's full body following a precise martial arts move) the model often produces artifacts because the drag points underspecify the desired motion. In production, drag control is most useful for hero shots with one or two key points that have to land precisely, with the rest of the motion handled by prompt or motion brush.

20.8.5 AnimateDiff and the Open Control Recipe

AnimateDiff (Guo et al., 2023, arXiv:2307.04725) is the canonical open recipe for adding motion to image diffusion models, and remains the most accessible playground for experimenting with video control in 2026. The core idea is to take a frozen image diffusion model (Stable Diffusion 1.5, SDXL, or SD3) and add a small motion module (a set of temporal-attention layers) that turns the still-image model into a video generator. The motion module is trained on video data while the image backbone is kept frozen, which lets AnimateDiff inherit all of the ControlNet, IP-Adapter, and LoRA ecosystem of the underlying image model.

AnimateDiff plus existing image ControlNets is the open community's answer to controlled video generation. A pose-driven workflow looks like this: extract per-frame DWPose skeleton from a reference video (using e.g. MMPose), condition the underlying SDXL via ControlNet on those skeleton frames, and run AnimateDiff over the conditioned diffusion process. The result is a video where the character moves in the pose extracted from the reference but rendered in the prompted style. This is the architecture behind many viral 2024 "anime in the style of X dancing the choreography of Y" memes; it is also the basis of more serious production pipelines for animated content.

# Motion-controlled video generation with AnimateDiff + DWPose ControlNet
# pip install diffusers accelerate controlnet_aux mmpose
import torch
from diffusers import (
    AnimateDiffPipeline,
    MotionAdapter,
    ControlNetModel,
)
from diffusers.utils import export_to_video
from controlnet_aux import DWposeDetector
from PIL import Image

# 1) Extract per-frame skeleton from a short reference video.
pose_detector = DWposeDetector.from_pretrained("yzd-v/DWPose")
reference_frames = [            # PIL.Image list, e.g. 16 frames at 512x512
    Image.open(f"ref/frame_{i:03d}.png") for i in range(16)
]
pose_frames = [pose_detector(img) for img in reference_frames]

# 2) Load SDXL + AnimateDiff motion adapter + DWPose ControlNet
adapter = MotionAdapter.from_pretrained(
    "guoyww/animatediff-motion-adapter-sdxl-beta",
    torch_dtype=torch.float16,
)
controlnet = ControlNetModel.from_pretrained(
    "thibaud/controlnet-openpose-sdxl-1.0",
    torch_dtype=torch.float16,
)
pipe = AnimateDiffPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    motion_adapter=adapter,
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# 3) Generate: same skeleton, different style
prompt = (
    "a knight in shining silver armor performing the same motion, "
    "dramatic studio lighting, photorealistic, cinematic"
)
result = pipe(
    prompt=prompt,
    control_image=pose_frames,
    num_frames=16,
    num_inference_steps=25,
    guidance_scale=7.5,
    controlnet_conditioning_scale=0.85,
)
export_to_video(result.frames[0], "knight_motion.mp4", fps=8)
print("Wrote knight_motion.mp4")
Output: Wrote knight_motion.mp4
Code Fragment 20.8.1: Motion-controlled video generation by extracting a pose skeleton from a reference clip, conditioning SDXL via DWPose ControlNet, and animating it with AnimateDiff's motion adapter. The same motion choreography can be re-rendered in any style by changing the prompt, holding the pose conditioning fixed. controlnet_conditioning_scale trades off pose fidelity (high) against prompt adherence (low).

20.8.6 The ControlNet Channels for Video

Image ControlNet (Zhang et al., 2023) introduced a family of conditioning channels: depth maps, edge maps (Canny, HED, Sobel), segmentation masks, normal maps, and pose skeletons. Each channel adds a different inductive bias for what the model should preserve. Video ControlNet (and SparseCtrl, the explicit video-extension paper by Guo et al., 2024) ports the same idea: condition the video DiT on per-frame channel sequences and the model respects the structural information they encode.

The most useful channels for production video work are: depth (preserves the 3D structure of the scene; useful for stylized rerender of a real-world reference), pose (preserves human motion; useful for re-rendering dance, sports, or martial arts), Canny edges (preserves silhouettes and major lines; useful for stylized rerender that maintains composition), and segmentation (preserves region boundaries; useful for object-level animation control). Each channel can be combined with text prompts; the standard recipe is to extract the channel from a reference video, prompt the desired aesthetic, and let the model fill in everything else.

Control ModalitySpecificityEffortBest For
Prompt aloneLowTrivialExploratory generation
Camera path (6-DoF)Medium (global)MediumCinematic moves
Motion brushMedium (regional)LowStylized 2D animation
Drag pointsHigh (point-level)MediumHero-shot precision
Pose skeletonVery high (body)Medium-highChoreography, dance, sports
Depth mapVery high (3D)High (per-frame extraction)Stylized rerender of real video
Canny edgesVery high (silhouette)HighStyle transfer with composition lock
SegmentationVery high (regions)HighObject-level animation
Figure 20.8.1a: Video control modalities. Effort columns reflect both the per-shot human work and the per-shot compute cost. Production pipelines usually compose two or three modalities; rarely is a single channel sufficient.

20.8.7 Character Consistency Across Shots

The control modalities discussed so far operate within a single shot. The harder problem is character consistency across multiple shots (cut from a wide shot of Alice walking into a building to a medium shot of Alice opening a door; Alice should be visibly the same person, same outfit, same hair, in both shots). This is essentially the video equivalent of IP-Adapter for images.

The 2025 solution that the frontier models converged on is the character token: train the model to attend to a special token derived from a few reference frames of the character, and reuse that token across all shots in a multi-shot sequence. Sora 2's cameos, Runway Gen-4's character sheets, and Pika's "Pikaframes" all implement this pattern with implementation details that differ. The open-source community uses IP-Adapter Face for the analogous capability in AnimateDiff workflows: extract a face embedding from reference images, condition the diffusion process on that embedding, and the rendered character resembles the reference across shots.

Multi-shot consistency is the gating capability for using video AI in narrative work. A trailer or short film with consistent characters is qualitatively different from a sequence of plausible but visually disconnected shots. Section 33.5 covers long-form and cinematic generation, which is where these multi-shot tools matter most.

Real-World Scenario: Music Video Production

A musician producing a 3-minute music video for an independent release used a controlled-generation pipeline rather than a live shoot. Step 1: storyboard 12 shots with rough sketches. Step 2: shoot the musician in front of a green screen for 10 minutes of varied poses, expressions, and dance moves; this provides character reference plates. Step 3: for each shot, use Runway Gen-4 with the musician's reference clip as the character anchor, a custom camera-path specification (orbit, dolly, etc.), and a prompt describing the location and lighting. Step 4: edit the 12 shots together in Premiere with the song. Step 5: ElevenLabs cleans up any audio leakage between shots. Total budget for the video was $1,200 in API costs, two days of one editor's time, and ten minutes of the artist's time. The same video shot traditionally would have been a $30,000-100,000 production.

20.8.8 Where Control Is Going

Two trends define 2026's trajectory in video control. First, controls are becoming compositional: instead of choosing one control modality, the user specifies multiple (camera path plus character reference plus motion brush plus depth) and the system blends them. Sora 2 and Gen-4 both expose this; the open community is catching up via composable AnimateDiff workflows. Second, controls are becoming declarative: instead of dragging points or painting brushes, the user describes the desired control in natural language ("the camera should slowly dolly in while the character walks toward it, and the lighting should shift from warm to cold over the 5-second duration"), and an LLM agent translates the description into the underlying control tokens. This is the LLM-as-director pattern Section 33.5 will revisit; it is the most likely interface for non-technical creators in 2026 and beyond.

Fun Fact: The drag-point exploit

A favorite 2024 community discovery was that DragNUWA-style controls could be used to produce videos that violate intuitive physics: drag a ball's trajectory upward through the entire 4-second clip, and the model dutifully produces a ball that rises at constant velocity, defying gravity. The result is mostly used for absurdist memes (an apple floating off a table; a car rolling uphill; rain falling sideways), but it also demonstrates that the control signal can override the model's learned physics priors. This is a feature for stylized animation and a bug for documentary-style generation; the user can pick which.

Key Insight

Controlled video generation is conditioning with extra tokens: camera path, motion brush, drag points, pose skeleton, and ControlNet channels are all variants of the same architectural pattern. Production pipelines compose two or three modalities (commonly camera path plus character reference plus motion brush) to get intentional output rather than just plausible output. AnimateDiff plus existing image ControlNets is the open recipe for experimentation; Sora 2, Veo 3, and Runway Gen-4 ship the polished commercial versions. The interface frontier is compositional controls described in natural language and translated to underlying tokens by an LLM-as-director.

Self-Check
Q1: Architecturally, "camera path conditioning" and "pose skeleton conditioning" both inject extra token streams into the DiT. Why is camera path easier to train (i.e., produces more reliable models with less data) than pose conditioning, and what would you do at the data collection step to close the gap?
Show Answer
A 6-DoF camera trajectory is a tiny low-dimensional signal (six floats per frame) that maps cleanly to a global image transform every pixel obeys, so off-the-shelf SLAM (COLMAP, DROID-SLAM) can label millions of unconstrained internet videos and the model learns the conditioning with relatively few examples. Pose skeletons are high-dimensional (tens of 2D or 3D keypoints), they describe only a localized region of the image, and the conditioning conflict between "this character must follow this skeleton" and "the rest of the scene must look natural" is harder. To close the gap at data-collection time, you would pair every video with a multi-person DWPose extraction, hard-mine clips with strong character motion, and add synthetic pose-conditioned renders so the model sees clean ground-truth skeleton-video pairs in addition to noisy in-the-wild extractions.
Q2: Motion brush gives you regional direction arrows but does not let you specify per-pixel trajectories. Where exactly does this control modality fail (give a concrete scene), and what is the smallest extension that would address the failure?
Show Answer
Motion brush fails whenever the desired motion is non-uniform inside a single brushed region. A clean example is a flag rippling in wind: the brush covers the whole flag and the user can only point the arrow in one direction, but real flag motion has waves traveling along the cloth and small reverse motions near the pole. The smallest extension that fixes this is to let the user paint multiple sub-regions inside the same object with different arrows, or equivalently to let the brush carry a coarse 2D vector field rather than a single arrow. That is exactly the design Pika 2.0's Pikaffects and several research follow-ups adopted.
Q3: The "drag-point exploit" lets users override physics priors. Decide whether this is a feature or a bug for: (a) a children's animation studio, (b) a courtroom video reconstruction service, (c) a music video for an indie artist. Justify each answer.
Show Answer
For (a) a children's animation studio, it is a feature: cartoons exaggerate motion for expressive effect and drag-point override is the cheap path to bouncy, anti-physical character motion. For (b) a courtroom video reconstruction service, it is unambiguously a bug and arguably a safety hazard: a reconstruction that silently violates physics could mislead a jury, and the service should disable physics-overriding drag and prefer simulation-anchored generation with provenance. For (c) an indie music video, it is a feature: the form is artistic, distorted motion is part of the aesthetic, and the artist's intent legitimately overrides the model's physics prior.

In the next section, Section 20.9: Video Editing and Remixing, we continue.

What Comes Next

Section 33.4 pivots from generating new video to editing existing video: video inpainting, style transfer, frame interpolation, and the production pipelines that stitch these tools together. The same control machinery from this section is the substrate for editing because editing is mostly "generate new content that respects some existing content."

Further Reading
  • Zhang, L. et al. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet). ICCV 2023. arXiv:2302.05543. The foundational paper on dense per-pixel conditioning for diffusion models; the parent of every video control modality.
  • Guo, Y. et al. (2023). AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. ICLR 2024. arXiv:2307.04725. The motion-module recipe that lets any frozen image diffusion model become a video generator.
  • Wang, Z. et al. (2023). MotionCtrl: A Unified and Flexible Motion Controller for Video Generation. arXiv:2312.03641. The first credible camera-path control adapter for Stable Video Diffusion.
  • He, H. et al. (2024). CameraCtrl: Enabling Camera Control for Text-to-Video Generation. arXiv:2404.02101. The 2024 refinement of camera-path conditioning with cleaner training data extraction.
  • Yin, S. et al. (2023). DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory. arXiv:2308.08089. The drag-trajectory control pattern; foundational for Runway Gen-4's Motion Vector tool.
  • Guo, Y. et al. (2024). SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models. ECCV 2024. arXiv:2311.16933. ControlNet extensions for video with sparse-frame conditioning.
  • Yang, J. et al. (2024). DWPose: Effective Whole-body Pose Estimation. arXiv:2307.15880. The de facto open pose detector used in most AnimateDiff control workflows.
  • Wu, J. et al. (2024). DragAnything: Motion Control for Anything using Entity Representation. ECCV 2024. arXiv:2403.07420. Drag-point control extended to multi-entity strict trajectory adherence.