"Anyone can prompt a model to make a video. The craft is making it move the way you imagined."
Pixel, Camera-Controlling AI Agent
A frontier video model from Section 33.2 produces a plausible shot from a prompt. A frontier video model from a craftsperson produces the shot they had in mind. The bridge between the two is the control layer: camera path conditioning (dolly, pan, orbit, crane moves), motion brush (mask out a region and indicate the direction it should move), drag-based control (click and drag to specify how an object travels through the frame), pose and skeleton conditioning (animate a character to match a reference motion), and ControlNet-style multi-channel guidance (depth, edge, segmentation maps that anchor the generation). This section walks each control modality, shows how they plug into the underlying video DiT, and demonstrates AnimateDiff as the open recipe most accessible for experimentation.
Prerequisites
This section assumes the video DiT internals from Section 20.6 and the ControlNet image-conditioning pattern from Section 19.10. The camera-pose vocabulary is revisited in the 3D-generation chapter later in this part.
20.8.1 The Control Problem and the Spectrum of Solutions
A pure text-to-video model has one input (the text prompt) and emits an uncontrolled distribution of outputs. For most production purposes this is too loose: the cinematographer wants this specific camera move, the animator wants this specific motion, the editor wants this specific timing. Adding controls to a video DiT is an example of the broader pattern from image generation (ControlNet, IP-Adapter, T2I-Adapter): inject additional conditioning signals at training time so the model learns to respect them, then expose those signals at inference time as user-controllable inputs.
The spectrum of video controls spans, roughly in order of increasing specificity:
- Prompt-level: "a slow dolly-in shot of..." (cheap, low fidelity)
- Camera path: explicit 6-DoF trajectory or named-move (medium cost, high fidelity for global camera)
- Motion brush: per-region direction arrows (medium cost, high fidelity for foreground motion)
- Drag control: precise point trajectories (high cost, very high fidelity, fragile)
- Pose / skeleton: full-body or facial keypoints over time (high cost, very high fidelity for characters)
- ControlNet channels: dense per-frame depth, edges, or segmentation (highest cost, highest fidelity for structural copying)
Production systems compose these. Veo 3 ships camera-path control natively; Runway Gen-4 ships motion brush plus drag-based control plus reference-driven character consistency; AnimateDiff (open) ships pose conditioning and ControlNet channels via plug-in motion modules. Choosing the right control modality depends on what aspect of the shot you want to lock down.
Every video-control modality at the architectural level is the same thing: an additional set of input tokens (or an additional cross-attention stream) that the DiT attends to during denoising. Camera path becomes a sequence of 6-DoF tokens per frame. Motion brush becomes a 2D vector field per frame. Pose becomes a skeleton-token sequence. ControlNet channels become an extra image-channel stream per frame. The architecture does not change; the conditioning vocabulary grows. This is the same trick that turned image ControlNet into the universal control framework for diffusion models in 2023, applied one dimension up.
20.8.2 Camera Path Conditioning
Camera path conditioning is the most useful control modality for narrative cinematography. The user specifies a camera trajectory (a sequence of 6-DoF poses: position + orientation per frame, or a named move like "slow dolly in", "orbit clockwise", "crane up"), and the model generates a video whose implicit camera matches that path. The first credible commercial implementation was MotionCtrl (Wang et al., 2023, arXiv:2312.03641), an adapter that retrofitted camera-path conditioning onto Stable Video Diffusion. CameraCtrl (He et al., 2024) followed for SVD with cleaner training and stronger generalization. By 2025 the major commercial systems shipped native camera control: Luma Dream Machine, Pika, Kling, and Veo 3 all expose 6-DoF camera input either as a parameter to the API or as an interactive 3D widget in the UI.
The training trick that makes camera control work is to extract camera trajectories from video training data using off-the-shelf SLAM or structure-from-motion (COLMAP, DROID-SLAM), then condition the DiT on the extracted camera path as an additional input. At inference, the user specifies the desired path and the model generates a video that respects it. Quality is good for short clips with simple moves; complex multi-segment paths (a dolly-in followed by a pan followed by a crane move) still have artifacts at the seams between segments.
20.8.3 Motion Brush and Region Direction Arrows
Motion brush is the most intuitive control modality for non-technical users. The user paints a region of the first frame (the "brush") and draws an arrow indicating how that region should move over the duration of the clip. The most common implementations support multiple regions (e.g., "this person walks left, this car drives right, the background stays still") and intensity control (gentle motion vs. fast motion).
The architectural implementation conditions the DiT on a per-frame 2D vector field plus per-region masks. The vector field encodes the desired motion direction and magnitude at each pixel; the masks indicate which regions are under user control versus which should follow the prompt's natural motion. Kling 2.0 introduced motion brush as a core feature in 2024; Pika 2.0 followed with a UI variation called Pikaffects. Both are particularly well-suited for stylized animation work (anime, 2D motion graphics) where the desired motion is precisely controlled.
20.8.4 Drag-Based Control: DragNUWA and Its Descendants
Drag-based control gives the user point-level precision: click on a pixel in the first frame, drag to where that pixel should end up at the end of the clip, and the model produces a video where that point travels along the specified trajectory. DragNUWA (Yin et al., 2023, arXiv:2308.08089) was the first widely-known implementation; DragAnything (Wu et al., 2024) extended the approach to handle multiple drag points with strict trajectory adherence. Runway Gen-4's "Motion Vector" tool is the polished commercial version of this idea.
The control is precise but fragile. For simple object translations (a ball moving across a table, a character's hand reaching to grab a cup) it works well. For complex transformations (a face shape-shifting, a character's full body following a precise martial arts move) the model often produces artifacts because the drag points underspecify the desired motion. In production, drag control is most useful for hero shots with one or two key points that have to land precisely, with the rest of the motion handled by prompt or motion brush.
20.8.5 AnimateDiff and the Open Control Recipe
AnimateDiff (Guo et al., 2023, arXiv:2307.04725) is the canonical open recipe for adding motion to image diffusion models, and remains the most accessible playground for experimenting with video control in 2026. The core idea is to take a frozen image diffusion model (Stable Diffusion 1.5, SDXL, or SD3) and add a small motion module (a set of temporal-attention layers) that turns the still-image model into a video generator. The motion module is trained on video data while the image backbone is kept frozen, which lets AnimateDiff inherit all of the ControlNet, IP-Adapter, and LoRA ecosystem of the underlying image model.
AnimateDiff plus existing image ControlNets is the open community's answer to controlled video generation. A pose-driven workflow looks like this: extract per-frame DWPose skeleton from a reference video (using e.g. MMPose), condition the underlying SDXL via ControlNet on those skeleton frames, and run AnimateDiff over the conditioned diffusion process. The result is a video where the character moves in the pose extracted from the reference but rendered in the prompted style. This is the architecture behind many viral 2024 "anime in the style of X dancing the choreography of Y" memes; it is also the basis of more serious production pipelines for animated content.
# Motion-controlled video generation with AnimateDiff + DWPose ControlNet
# pip install diffusers accelerate controlnet_aux mmpose
import torch
from diffusers import (
AnimateDiffPipeline,
MotionAdapter,
ControlNetModel,
)
from diffusers.utils import export_to_video
from controlnet_aux import DWposeDetector
from PIL import Image
# 1) Extract per-frame skeleton from a short reference video.
pose_detector = DWposeDetector.from_pretrained("yzd-v/DWPose")
reference_frames = [ # PIL.Image list, e.g. 16 frames at 512x512
Image.open(f"ref/frame_{i:03d}.png") for i in range(16)
]
pose_frames = [pose_detector(img) for img in reference_frames]
# 2) Load SDXL + AnimateDiff motion adapter + DWPose ControlNet
adapter = MotionAdapter.from_pretrained(
"guoyww/animatediff-motion-adapter-sdxl-beta",
torch_dtype=torch.float16,
)
controlnet = ControlNetModel.from_pretrained(
"thibaud/controlnet-openpose-sdxl-1.0",
torch_dtype=torch.float16,
)
pipe = AnimateDiffPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
motion_adapter=adapter,
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
# 3) Generate: same skeleton, different style
prompt = (
"a knight in shining silver armor performing the same motion, "
"dramatic studio lighting, photorealistic, cinematic"
)
result = pipe(
prompt=prompt,
control_image=pose_frames,
num_frames=16,
num_inference_steps=25,
guidance_scale=7.5,
controlnet_conditioning_scale=0.85,
)
export_to_video(result.frames[0], "knight_motion.mp4", fps=8)
print("Wrote knight_motion.mp4")
controlnet_conditioning_scale trades off pose fidelity (high) against prompt adherence (low).20.8.6 The ControlNet Channels for Video
Image ControlNet (Zhang et al., 2023) introduced a family of conditioning channels: depth maps, edge maps (Canny, HED, Sobel), segmentation masks, normal maps, and pose skeletons. Each channel adds a different inductive bias for what the model should preserve. Video ControlNet (and SparseCtrl, the explicit video-extension paper by Guo et al., 2024) ports the same idea: condition the video DiT on per-frame channel sequences and the model respects the structural information they encode.
The most useful channels for production video work are: depth (preserves the 3D structure of the scene; useful for stylized rerender of a real-world reference), pose (preserves human motion; useful for re-rendering dance, sports, or martial arts), Canny edges (preserves silhouettes and major lines; useful for stylized rerender that maintains composition), and segmentation (preserves region boundaries; useful for object-level animation control). Each channel can be combined with text prompts; the standard recipe is to extract the channel from a reference video, prompt the desired aesthetic, and let the model fill in everything else.
| Control Modality | Specificity | Effort | Best For |
|---|---|---|---|
| Prompt alone | Low | Trivial | Exploratory generation |
| Camera path (6-DoF) | Medium (global) | Medium | Cinematic moves |
| Motion brush | Medium (regional) | Low | Stylized 2D animation |
| Drag points | High (point-level) | Medium | Hero-shot precision |
| Pose skeleton | Very high (body) | Medium-high | Choreography, dance, sports |
| Depth map | Very high (3D) | High (per-frame extraction) | Stylized rerender of real video |
| Canny edges | Very high (silhouette) | High | Style transfer with composition lock |
| Segmentation | Very high (regions) | High | Object-level animation |
20.8.7 Character Consistency Across Shots
The control modalities discussed so far operate within a single shot. The harder problem is character consistency across multiple shots (cut from a wide shot of Alice walking into a building to a medium shot of Alice opening a door; Alice should be visibly the same person, same outfit, same hair, in both shots). This is essentially the video equivalent of IP-Adapter for images.
The 2025 solution that the frontier models converged on is the character token: train the model to attend to a special token derived from a few reference frames of the character, and reuse that token across all shots in a multi-shot sequence. Sora 2's cameos, Runway Gen-4's character sheets, and Pika's "Pikaframes" all implement this pattern with implementation details that differ. The open-source community uses IP-Adapter Face for the analogous capability in AnimateDiff workflows: extract a face embedding from reference images, condition the diffusion process on that embedding, and the rendered character resembles the reference across shots.
Multi-shot consistency is the gating capability for using video AI in narrative work. A trailer or short film with consistent characters is qualitatively different from a sequence of plausible but visually disconnected shots. Section 33.5 covers long-form and cinematic generation, which is where these multi-shot tools matter most.
A musician producing a 3-minute music video for an independent release used a controlled-generation pipeline rather than a live shoot. Step 1: storyboard 12 shots with rough sketches. Step 2: shoot the musician in front of a green screen for 10 minutes of varied poses, expressions, and dance moves; this provides character reference plates. Step 3: for each shot, use Runway Gen-4 with the musician's reference clip as the character anchor, a custom camera-path specification (orbit, dolly, etc.), and a prompt describing the location and lighting. Step 4: edit the 12 shots together in Premiere with the song. Step 5: ElevenLabs cleans up any audio leakage between shots. Total budget for the video was $1,200 in API costs, two days of one editor's time, and ten minutes of the artist's time. The same video shot traditionally would have been a $30,000-100,000 production.
20.8.8 Where Control Is Going
Two trends define 2026's trajectory in video control. First, controls are becoming compositional: instead of choosing one control modality, the user specifies multiple (camera path plus character reference plus motion brush plus depth) and the system blends them. Sora 2 and Gen-4 both expose this; the open community is catching up via composable AnimateDiff workflows. Second, controls are becoming declarative: instead of dragging points or painting brushes, the user describes the desired control in natural language ("the camera should slowly dolly in while the character walks toward it, and the lighting should shift from warm to cold over the 5-second duration"), and an LLM agent translates the description into the underlying control tokens. This is the LLM-as-director pattern Section 33.5 will revisit; it is the most likely interface for non-technical creators in 2026 and beyond.
A favorite 2024 community discovery was that DragNUWA-style controls could be used to produce videos that violate intuitive physics: drag a ball's trajectory upward through the entire 4-second clip, and the model dutifully produces a ball that rises at constant velocity, defying gravity. The result is mostly used for absurdist memes (an apple floating off a table; a car rolling uphill; rain falling sideways), but it also demonstrates that the control signal can override the model's learned physics priors. This is a feature for stylized animation and a bug for documentary-style generation; the user can pick which.
Controlled video generation is conditioning with extra tokens: camera path, motion brush, drag points, pose skeleton, and ControlNet channels are all variants of the same architectural pattern. Production pipelines compose two or three modalities (commonly camera path plus character reference plus motion brush) to get intentional output rather than just plausible output. AnimateDiff plus existing image ControlNets is the open recipe for experimentation; Sora 2, Veo 3, and Runway Gen-4 ship the polished commercial versions. The interface frontier is compositional controls described in natural language and translated to underlying tokens by an LLM-as-director.
Show Answer
Show Answer
Show Answer
In the next section, Section 20.9: Video Editing and Remixing, we continue.
What Comes Next
Section 33.4 pivots from generating new video to editing existing video: video inpainting, style transfer, frame interpolation, and the production pipelines that stitch these tools together. The same control machinery from this section is the substrate for editing because editing is mostly "generate new content that respects some existing content."
Further Reading
- Zhang, L. et al. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet). ICCV 2023. arXiv:2302.05543. The foundational paper on dense per-pixel conditioning for diffusion models; the parent of every video control modality.
- Guo, Y. et al. (2023). AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. ICLR 2024. arXiv:2307.04725. The motion-module recipe that lets any frozen image diffusion model become a video generator.
- Wang, Z. et al. (2023). MotionCtrl: A Unified and Flexible Motion Controller for Video Generation. arXiv:2312.03641. The first credible camera-path control adapter for Stable Video Diffusion.
- He, H. et al. (2024). CameraCtrl: Enabling Camera Control for Text-to-Video Generation. arXiv:2404.02101. The 2024 refinement of camera-path conditioning with cleaner training data extraction.
- Yin, S. et al. (2023). DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory. arXiv:2308.08089. The drag-trajectory control pattern; foundational for Runway Gen-4's Motion Vector tool.
- Guo, Y. et al. (2024). SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models. ECCV 2024. arXiv:2311.16933. ControlNet extensions for video with sparse-frame conditioning.
- Yang, J. et al. (2024). DWPose: Effective Whole-body Pose Estimation. arXiv:2307.15880. The de facto open pose detector used in most AnimateDiff control workflows.
- Wu, J. et al. (2024). DragAnything: Motion Control for Anything using Entity Representation. ECCV 2024. arXiv:2403.07420. Drag-point control extended to multi-entity strict trajectory adherence.