Section 20.9: Video Editing and Remixing

"Generation is the headline. Editing is the long tail. The model that fixes a single bad frame in a million-dollar shot is worth more in production than the model that generated the shot."
Pixel, Cut-Loving AI Agent

Big Picture

Most working video AI in 2026 is editing, not generation. Removing an unwanted object from a finished shot (inpainting), replacing the style or season of a video (style transfer), doubling the frame rate of a vintage recording (frame interpolation with RIFE, FILM), upscaling a 480p archive to 4K (Real-ESRGAN, Topaz Video AI), and stitching all of this into a production pipeline with the DAW-equivalent of audio editing in Section 20.4. This section walks the four post-production AI tasks that touch the most real customer workflows, shows their architectures (largely diffusion and flow-matching, with task-specific losses), and discusses the production patterns that compose them into film, broadcast, and social-media pipelines.

Prerequisites

This section assumes the video generation models from Section 20.7, the inpainting and editing patterns from Section 19.10, and basic familiarity with optical flow and frame interpolation networks.

20.9.1 Video Inpainting: Removing, Replacing, and Extending

Video inpainting fills in masked regions of a video sequence over time, propagating fill content coherently across frames. The 2024 state of the art uses one of three architectural patterns. The first is temporal-coherent 2D inpainting: run a still-image inpainting model (Stable Diffusion XL Inpaint, LaMa) frame-by-frame with optical-flow-warped masks for temporal consistency. The second is video-native diffusion inpainting: a video DiT trained with a masking objective that takes a partially masked video and fills in the masked regions in one forward pass. The third is NeRF or 3D-Gaussian-splatting reconstruction: rebuild a 3D scene from the unmasked pixels and re-render to fill the masked regions; this is overkill for most cases but handles complex camera moves where other methods fail.

The use cases that drive video inpainting in production: removing wires from action shots (rigging marks, safety cables), erasing watermarks from licensed footage that shipped them at the wrong tier, removing distracting bystanders or unwanted brand logos from a scene, replacing a stunt double's face with the actor's face (the formal "deepfake face replacement" technique that VFX houses have been doing manually for a decade and now mostly automate), and extending the frame edges to convert a 16:9 master to a 2.39:1 cinematic crop while gaining horizontal footage. Runway's "Inpaint" tool, Adobe Premiere Pro's "Object Removal," and the open-source ProPainter (Zhou et al., 2023) are the most-used implementations.

Key Insight: Inpainting is generation with an unusual loss

From the model's perspective, inpainting is "generate frames that match the rest of the video except in this masked region." The training objective is reconstruction loss on the masked region given the unmasked region as context. The model architecture is the same DiT from Section 33.1; only the conditioning (a mask plus the visible pixels) differs. Once you see this equivalence, the line between "video generation" and "video editing" blurs to nothing: a model that can do unconditional text-to-video generation already knows almost everything it needs to do inpainting, and the engineering work is the conditioning interface plus the training-set rebalancing toward masking-heavy data.

20.9.2 Style Transfer for Video

Video style transfer takes a source video (the content) and reshapes its visual style to match a target reference (an image, a video, or a text prompt describing the desired style). The 2024 standard is to use a video DiT with depth or edge conditioning (Section 33.3) to preserve the source's structural content, while a text or image prompt steers the aesthetic. This is the video equivalent of img2img with strong structural conditioning.

The hardest case is video-to-video where both content and style come from videos. The seminal work here is TokenFlow (Geyer et al., 2024) and StableVideo (Chai et al., 2023), which compute token-level correspondences between a source video and an edited diffusion latent to enforce temporal consistency. Runway's "Gen-3 Style" mode and Pika's "Video-to-Video Style Transfer" are commercial productionizations of similar ideas. Quality is strong for stylized aesthetics (cartoon, watercolor, oil painting) and weaker for photorealistic style transfer (this video shot during winter rendered as a summer scene); the latter requires careful prompt engineering and often does not converge to acceptable quality.

20.9.3 Frame Interpolation: RIFE and FILM

Frame interpolation takes a video at frame rate N and produces a video at frame rate 2N (or 4N, or 8N) by synthesizing intermediate frames. The two open-source leaders are RIFE (Real-Time Intermediate Flow Estimation, Huang et al., 2022, arXiv:2011.06294) and FILM (Frame Interpolation for Large Motion, Reda et al., 2022, Google). Both use a neural network to estimate optical flow between the two reference frames and warp the pixels accordingly; FILM is more robust to large motion (sports, dance) while RIFE is faster and more efficient.

Saying these models "estimate optical flow and warp pixels" hides the mechanism that does the work. Optical flow is a dense displacement field $F_{t\to t+1}$ that, for each pixel coordinate $x = (u, v)$ in frame $I_t$, gives the offset $F_{t\to t+1}(x)$ to the location of the same scene point in frame $I_{t+1}$. Given the flow, backward warping synthesizes an aligned frame by sampling the source frame at the displaced coordinates:

$$ \hat{I}(x) = I\big(x + F(x)\big). $$

The displaced coordinate $x + F(x)$ almost never lands on an integer pixel, so $I(\cdot)$ is evaluated by bilinear sampling: the four pixels surrounding the sub-pixel location are blended by their fractional distances. This is "backward" warping because we iterate over the output grid and pull a value for each target pixel, which guarantees every output pixel is filled (a forward warp that pushes source pixels forward leaves holes wherever the flow diverges). Bilinear sampling is also differentiable in both $I$ and $F$, which is what lets the flow network train end to end through the warp.

To synthesize a frame at the midpoint $t + 0.5$, RIFE and FILM do not warp an endpoint all the way across; they estimate intermediate flows $F_{t\to t+0.5}$ and $F_{t+1\to t+0.5}$ that carry each bracketing frame to the midpoint, warp both inward, and fuse them. The fusion is a learned soft blend mask $W \in [0,1]$ that decides, per pixel, which warped frame to trust:

$$ \hat{I}_{t+0.5} = W \odot \hat{I}_{t\to t+0.5} + (1 - W) \odot \hat{I}_{t+1\to t+0.5}. $$

The mask matters most at occlusion boundaries, where a scene point is visible in only one of the two frames; there the network learns to set $W$ near $1$ or $0$ so the midpoint is taken from the frame that actually saw the content. Training combines two losses. A reconstruction loss, $\lVert \hat{I}_{t+0.5} - I_{t+0.5} \rVert$ against a held-out true middle frame, supervises the final pixels. A forward-backward flow-consistency term regularizes the flow itself: in regions with no occlusion, going forward then backward should return to the start, so the loss penalizes $\lVert F_{t\to t+1} + F_{t+1\to t} \rVert$ over matched (non-occluded) pixels. A photometric warping loss, $\lVert I_t - I_{t+1}(x + F_{t\to t+1}) \rVert$, supplies gradient even without a ground-truth middle frame by demanding that the flow make one frame reconstruct the other. The failure modes follow directly from these assumptions: at occlusions the consistency term has no valid match and the blend mask must paper over disoccluded content; under large motion the true displacement exceeds the network's receptive field, so the flow estimate collapses and the warp tears or ghosts. FILM's large-motion robustness comes from a scale-pyramid flow estimator that searches coarse-to-fine for exactly these big displacements.

The 2025 commercial state of the art (Topaz Video AI 5, DAIN-App, NVIDIA Frame Generation in real-time gaming) typically wraps RIFE or FILM with extra refinement passes plus optical-flow-guided diffusion for the hardest cases (large displacements, occlusions, transparent objects). Frame interpolation is currently the most boring AI capability in video production because it has been deployed at scale for years (DLSS 3 Frame Generation in games, real-time content upscaling on smart TVs), but it remains the foundation of every "convert 24 fps film to 60 fps Smart TV" feature and many specialized production workflows.

# Frame interpolation with RIFE via the rife-ncnn-vulkan command line
# or the rife_pytorch fork. For pure-Python work, see practical-rife on GitHub.
import subprocess
from pathlib import Path

src = Path("footage_24fps.mp4")
dst = Path("footage_60fps.mp4")

# Step 1: extract frames at original rate
subprocess.run([
    "ffmpeg", "-y", "-i", str(src), "-qscale:v", "2",
    "frames_in/%06d.png",
], check=True)

# Step 2: interpolate to 2.5x rate (24 -> 60 fps) with RIFE
# rife-ncnn-vulkan needs the target frame count; we compute it here.
orig_count = len(list(Path("frames_in").glob("*.png")))
target_count = int(orig_count * 60 / 24)

subprocess.run([
    "rife-ncnn-vulkan",
    "-i", "frames_in",
    "-o", "frames_out",
    "-m", "rife-v4.13",      # model version
    "-n", str(target_count),   # target total frame count
    "-j", "2:2:2",           # GPU thread config
], check=True)

# Step 3: reassemble at the new frame rate
subprocess.run([
    "ffmpeg", "-y", "-framerate", "60",
    "-i", "frames_out/%06d.png",
    "-c:v", "libx264", "-crf", "18", "-pix_fmt", "yuv420p",
    str(dst),
], check=True)
print(f"Wrote {dst} at 60 fps from {orig_count} original frames")

Output: Wrote footage_60fps.mp4 at 60 fps from 1440 original frames

Code Fragment 20.9.1: RIFE-based frame interpolation pipeline: extract frames with ffmpeg, interpolate with the rife-ncnn-vulkan CLI, reassemble with ffmpeg. For Python-native workflows, the practical-rife package wraps this in a single API call. The same recipe works for FILM (replace the RIFE binary with FILM's TF/PyTorch inference script).

20.9.4 Upscaling and Restoration

Video upscaling takes a low-resolution input and produces a higher-resolution output with plausible high-frequency detail. The architectures of choice are Real-ESRGAN (Wang et al., 2021) for general-purpose upscale, SwinIR for detail-preserving upscale, and VideoSwinIR or BasicVSR++ for video-specific upscale that uses temporal information. Topaz Video AI is the dominant commercial product, wrapping multiple open and proprietary models behind a single UI used by archival video restoration teams worldwide.

Restoration is a broader category that includes denoising (old film grain that does not look intentional), de-flickering (older recordings with brightness variations between frames), color correction (faded color stock), and de-blurring (motion or focus blur). The 2025 trend is unified restoration models that handle all of these in one forward pass, trained on synthetic degradations applied to clean source video. Adobe's "Enhance Video" feature (2025) is a productionization of this pattern.

Warning: Upscaling is hallucination

A 4K upscale of a 480p source does not recover real high-frequency detail; the model invents plausible detail given the low-frequency content. For most use cases (consumer video, archival, social media) this is fine. For documentary or evidentiary work it is not: the upscaled video looks more authoritative than the original but contains information the original did not have. Always document the upscaling step in production metadata, and never use upscaled video as evidence of facts that depend on the high-frequency detail (license plates, faces, signage). The hallucination is not a bug; it is the operating principle of the technology.

20.9.5 Production Pipeline: Stitching the Tools Together

A typical 2026 video editing pipeline composes four to eight AI tools. The reference workflow that has emerged in mid-tier production houses looks like this. First, generate or assemble raw footage (Section 33.2). Second, denoise and stabilize the raw frames if shot on lower-end gear. Third, inpaint to remove unwanted elements (wires, watermarks, brand logos). Fourth, apply style transfer or color grading via a depth-conditioned diffusion pass if a stylized look is desired. Fifth, interpolate frame rate up to delivery target (24 to 60 fps for streaming, 24 to 120 fps for slow-motion segments). Sixth, upscale resolution to delivery target (1080p to 4K). Seventh, separately produce audio (Section 20.3 for music, Section 20.1 for narration, Section 20.4 for sound design). Eighth, edit everything together in a traditional NLE (DaVinci Resolve, Premiere Pro) with human creative direction.

Each step has multiple model choices and trade-offs. The full pipeline takes minutes to hours of compute per minute of output, depending on resolution, length, and model selection. For a small indie production this is a cost of $50-300 per finished minute; for a studio-grade project requiring frontier-quality output it is $500-3,000 per finished minute. Both are orders of magnitude cheaper than traditional production for content where the AI workflow is acceptable.

Task	Open Model	Commercial Tool	Quality Tier
Inpainting	ProPainter, SDXL Inpaint + flow	Adobe Premiere Object Removal, Runway Inpaint	Mid (open) / High (commercial)
Style transfer	AnimateDiff + Depth ControlNet, TokenFlow	Runway Gen-3 Style, Pika V2V	Mid (open) / High (commercial)
Frame interpolation	RIFE v4.13, FILM	Topaz Video AI 5, DAIN-App	High (both)
Upscaling	Real-ESRGAN, BasicVSR++	Topaz Video AI, Adobe Enhance Video	High (both)
Denoising	FastDVDNet, RVRT	Neat Video, Topaz	High (both)
Stabilization	FuSta, DiVS	Adobe Warp Stabilizer (classic + neural)	High (both)
Color grading	Manual + LUTs	DaVinci Color Match (neural), Magic Mask	Mid (manual) / High (commercial)
Slow motion	RIFE + FILM ensemble	Topaz Apollo, NVIDIA Slow-Mo	High (both)

Figure 20.9.1a: Video editing AI matrix. Open and commercial options exist for every category; the gap has narrowed for frame interpolation and upscaling, where open models match commercial quality, and remains wide for inpainting and style transfer, where commercial tools have stronger integration and quality.

20.9.6 The LLM-as-Editor Pattern

The same agentic editing pattern from Section 20.6 (audio) is emerging for video. The user describes the desired edit in natural language ("remove the boom mic that's visible in the top-right of the kitchen shot, brighten the underexposed dialogue scene by half a stop, slow down the action sequence at 02:13 to 240 fps and hold for one beat") and an LLM agent translates the description into the appropriate sequence of tool calls. Runway's "Act-One" and Adobe Premiere Pro's "Generative Edit" both ship variants of this pattern in 2025; Descript extended its audio agent to video in 2024.

The interface shift will reshape video editing the same way it is reshaping audio editing. The skilled editor's job becomes creative direction and final-pass polish rather than the mechanical work of navigating timelines, scrubbing for issues, and dragging clips. Whether the resulting pipelines actually reduce headcount in production teams or just enable smaller teams to produce more output is, in 2026, an open question that varies by studio.

Real-World Scenario: Documentary Archive Restoration

A documentary filmmaker working with a 1968 16mm film archive uses a stacked pipeline: Topaz Video AI 5 for grain removal and stabilization, Real-ESRGAN for upscale from 480i-equivalent to 2K, RIFE for 24-to-48 fps interpolation where the source was shot at 24, ProPainter for removal of physical film damage (scratches, splice marks), and a manual color-grade pass in DaVinci Resolve. Total compute time is about 4 hours per minute of finished archive material on a single RTX 4090. Traditional manual restoration of the same footage by a film-restoration house would cost $200-400 per minute and take weeks of skilled labor. The AI pipeline produces "Good Enough" quality at one to two orders of magnitude lower cost; the human restoration is preferred only for the highest-value source material.

20.9.7 The Deepfake Line

Video editing AI sits on the same legal and ethical fault line as voice cloning (Section 20.2). The same model that lets you remove a wire from a stunt shot can replace a stunt double's face with the actor's face, or replace the actor's face with a politician's face for the purposes of disinformation. The technical countermeasures are similar to audio: provenance metadata via C2PA, watermarking built into commercial video generators, and detection classifiers that look for the specific artifacts of generative editing.

Detection is harder for video than for audio. The 2024 Deepfake Detection Challenge baselines plateau around 70% accuracy on diverse test sets, and the 2025 commercial detectors (Truepic, Reality Defender, Microsoft Video Authenticator 2) achieve similar numbers. The countermove that the policy community has converged on is the same as for audio: emphasize provenance over detection. If a verified-source camera shoots a video, the provenance metadata travels with the file through editing tools that preserve the chain of custody, and any clip that lacks valid provenance is treated as suspect regardless of detection score. The C2PA standard is the technical underpinning of this approach; major camera manufacturers (Leica, Sony, Nikon) ship cameras with C2PA-signed output in 2025.

Fun Fact: The "deaging" loop

Hollywood spent 2010-2024 perfecting de-aging VFX one shot at a time with custom rigs and per-shot artists at $2k-20k per minute of finished footage. The 2024 commercial face replacement (Runway Act-One, Synthesia) shipped a generalized version of the same capability at $0.50-5 per minute. Major studios still use the bespoke approach for tentpole productions, but the gap is closing fast, and several 2025 streaming series used the AI approach for character flashback scenes that did not justify a custom VFX shoot. The technology that took a decade and a billion dollars to mature on the high end was commoditized in 18 months.

Key Insight

Video editing in 2026 is a small set of neural models (ProPainter for inpainting, TokenFlow and AnimateDiff for style transfer, RIFE and FILM for interpolation, Real-ESRGAN and Topaz for upscaling) composed into pipelines that cost a few hundred dollars per minute of finished output. The architectural backbone is the same DiT and diffusion stack from Sections 33.1-33.3; only the loss function and conditioning interface change for each task. The LLM-as-editor pattern is starting to compose these tools into agent-driven workflows. The deepfake risk is real, and the production response is provenance metadata via C2PA, layered with watermarking and detection.

Self-Check

Q1: Video inpainting and video style transfer both reduce to the same underlying generative task with different conditioning. Sketch the conditioning interface for each and identify the smallest training change that would convert an inpainting model into a style-transfer model.

Show Answer

Inpainting conditions the DiT on the unmasked pixels plus a per-frame binary mask; the model fills in the masked region while matching the surrounding context. Style transfer conditions the same DiT on per-frame structural channels (typically depth or Canny edges) plus a text or image style prompt; the model regenerates every pixel while obeying the structural conditioning. The smallest training change is to swap the conditioning recipe from "visible-pixel context + mask" to "structural channel + style prompt" and rebalance the training corpus so that every example has a paired depth or edge channel. The DiT itself, including all attention weights, can be initialized from the inpainting checkpoint.

Q2: RIFE and FILM both estimate optical flow and warp pixels. Explain why they fail on transparent objects (glass, water) and what conditioning signal you would add at training time to fix this.

Show Answer

Optical flow assumes each pixel comes from one place in the previous frame, but a glass-of-water pixel is a blend of the glass surface, the refracted background, and any specular highlight; warping under a single flow field produces ghosting or misalignment around transparent objects. The fix at training time is to add a per-pixel transparency or material channel (alpha matte plus refraction index from a vision-encoder or simulator) as auxiliary conditioning, so the model learns that flow on transparent pixels should be a weighted combination of multiple source-pixel candidates rather than a single warp. Synthetic data with rendered glass and water labelled with ground-truth alpha is the cheap way to bootstrap the channel.

Q3: The C2PA provenance scheme depends on a chain-of-custody invariant: every editing tool that touches a verified video must preserve and update the C2PA signature. List three failure modes that break this invariant in real workflows and design a mitigation for one of them.

Show Answer

Three common failure modes are: (1) re-encoding the video through a tool that strips metadata (most consumer transcoders), (2) screen-capture or screen-record steps that re-render frames and lose the signature entirely, and (3) cross-platform sharing where the host platform strips or rewrites EXIF and manifest data for compression or privacy reasons. A practical mitigation for (1) is to integrate the C2PA SDK directly into the encoder so any re-encode automatically writes a new manifest that references the source manifest, producing an unbroken cryptographic chain even after format conversion. The mitigation does not help with (2) or (3), which is why provenance has to be defended in depth rather than expected from any single tool.

In the next section, Section 20.10: Long-Form and Cinematic Video Generation, we continue.

What Comes Next

Section 33.5 closes the chapter with the hardest open problem in video AI: long-form, cinematic, multi-shot generation. The 2026 frontier can produce 2-3 minute clips with character consistency; what comes next is feature-length video and the agentic production workflows that orchestrate it.

Further Reading

Zhou, S. et al. (2023). ProPainter: Improving Propagation and Transformer for Video Inpainting. ICCV 2023. arXiv:2309.03897. The current open-source state of the art for video inpainting.
Geyer, M. et al. (2024). TokenFlow: Consistent Diffusion Features for Consistent Video Editing. ICLR 2024. arXiv:2307.10373. Token-level correspondence enforcement for video style transfer; the most cited 2024 paper on the topic.
Huang, Z. et al. (2022). Real-Time Intermediate Flow Estimation for Video Frame Interpolation (RIFE). ECCV 2022. arXiv:2011.06294. The widely deployed open-source frame interpolation model.
Reda, F. et al. (2022). FILM: Frame Interpolation for Large Motion. ECCV 2022. film-net.github.io. Google's frame interpolation reference, strong on large-motion cases where RIFE struggles.
Wang, X. et al. (2021). Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. ICCV 2021. arXiv:2107.10833. The open-source super-resolution model used as a baseline by most upscaling tools.
Topaz Labs (2025). Topaz Video AI 5: Restoration, Upscaling, and Interpolation. topazlabs.com/topaz-video-ai. The dominant commercial video restoration tool's documentation; documents which open models it wraps.
Adobe (2025). Generative Edit in Premiere Pro: AI-Driven Object Removal and Scene Changes. adobe.com/products/premiere/ai. The integration reference for LLM-driven video editing in a mainstream NLE.
C2PA (2024). C2PA Specification for Video Provenance. c2pa.org/specifications. The cross-vendor provenance metadata standard now shipping in Leica, Sony, and Nikon cameras.