Limitations: Adversarial Watermark Removal and the Cat-and-Mouse Game

Section 54.5

"Adversarial-removal papers keep arriving on the same day as new watermarking papers. The arXiv RSS feed is now a duel."

FrontierFrontier, Adversarial-Honest AI Agent
Big Picture

Every watermark and detector covered in Sections 56.2 through 56.4 can be circumvented by a determined attacker willing to spend modest compute. This is not a failure of design but a fundamental property: any signal that is imperceptible to humans is information-theoretically vulnerable to a sufficiently capable transformation. This section catalogs the known removal attacks (paraphrasing for text, regeneration-via-inversion for images, audio re-vocoding), explains the imperceptibility-vs-robustness tradeoff that bounds all watermarking schemes, and lands on the realistic appraisal: provenance technology is a useful layer in a defense stack, never a single-point solution. For LLM and agent practitioners this matters because the paraphrasing attack is just another LLM call: any team that ships an LLM watermark must assume an adversary will route the output through a second LLM in seconds, so watermarking belongs in the same defense-in-depth stack as guardrails, evaluation, and red-team testing rather than as a stand-alone safety claim.

The fundamental tradeoff that bounds every watermark: stronger embedding (larger δ for green-list text or pixel amplitude for images) raises robustness but also raises perceptibility, so the 'sweet spot' lives in a narrow band.
Figure 54.5.1: The fundamental tradeoff that bounds every watermark: stronger embedding (larger δ for green-list text or pixel amplitude for images) raises robustness but also raises perceptibility, so the "sweet spot" lives in a narrow band. Adversarial attacks (paraphrasing with a second LLM, DDIM inversion plus regeneration, audio re-vocoding) exploit that band. The realistic stance is that a watermark deters casual misuse and is one layer of a defense stack alongside C2PA, detectors, and policy, never a stand-alone safety claim.

Prerequisites

This section assumes the watermarking and provenance techniques from Section 54.2, Section 54.3, and the detection methods from Section 54.4.

54.5.1 The Fundamental Imperceptibility-Robustness Tradeoff

Fun Fact

Every watermark that has been deployed since 2022 has been broken within months of public scrutiny, usually by an undergraduate student with a paraphrase model and a weekend. The cat-and-mouse game continues mostly because deploying a watermark deters casual misuse, which is roughly 80% of the actual problem; the remaining 20% will never be solved by a watermark.

A watermark must satisfy two competing constraints: imperceptibility (it should not degrade the quality of the marked content as judged by humans) and robustness (it should survive transformations of the marked content). These constraints pull in opposite directions. A stronger embedding makes the watermark easier to detect after a transformation but easier to notice for a careful human; a weaker embedding is more transparent but more brittle.

Information theory makes this concrete. The "imperceptibility budget" is the set of perturbations to the content that fall below the just-noticeable difference (JND) threshold of human perception. Any watermark must live entirely within this budget. The "transformation budget" is the set of perturbations an attacker can apply without changing the content too much for their purposes (their JND on the recipient side). When the transformation budget contains the imperceptibility budget, the attacker can always wipe the watermark by re-quantizing within the JND envelope. This is roughly the situation for text watermarks under generative paraphrasing.

54.5.2 Text Watermark Removal: Paraphrase and Resample

The dominant attack against text watermarks is generative paraphrasing. The attacker runs the watermarked text through a second LLM with a prompt like "rewrite this paragraph in your own words." Two parallel studies, Krishna et al. (2023) and Sadasivan et al. (2024), measured the effect on detection across schemes.

Findings:

A subtler attack is the "watermark stealing" approach (Jovanovic et al., 2024): the attacker queries a watermarked model enough times to learn an approximate model of the green-list partition, then constructs adversarial texts that look watermarked when the detector runs against them. This poisons the detector's credibility, false positives undermine trust as effectively as false negatives.

54.5.3 Image Watermark Removal: Regeneration and Purification

For image watermarks, two attack families dominate. Both run the image through a denoising process tuned to remove the watermark while preserving visible content.

Diffusion purification. Run the watermarked image through a few steps of a generic diffusion model's noising-then-denoising loop. The noise washes out the high-frequency signal that the watermark depends on; the denoising restores plausible imagery. Saberi et al. (2024) showed this defeats SynthID-Image and stable-signature watermarks with minimal visible degradation.

Regenerative attack. Use an image-to-image model (Stable Diffusion img2img, or a fine-tuned regenerator) at low strength: the model re-generates the image while keeping the composition. The watermark is destroyed because the new pixels are sampled from a fresh distribution.

Both attacks succeed because the watermark is a small signal living in pixels that the regenerative model is happy to overwrite. The cost is one extra GPU-second per image, well within the budget of any motivated attacker.

Diagram showing a watermarked image being attacked. The original image with embedded SynthID watermark passes through three attack paths: (1) JPEG re-compression at q=40 (label: 'partial removal'); (2) Diffusion purification step (label: 'high removal, modest quality loss'); (3) Full regeneration via img2img at strength=0.3 (label: 'complete removal, slight composition change'). Each output is then run through the watermark detector. Detection success rates shown: original 99%, JPEG 75%, purification 18%, regeneration 4%. Side annotation: 'attack costs measured in seconds and cents; defense costs are sunk infrastructure'.
Figure 54.5.2: The attack tree for image watermark removal. Each attack adds different amounts of visible quality loss; the cheapest attack that survives the attacker's quality constraints is the one that wins. The asymmetry is harsh: defenders bear the cost of running detection on every image; attackers bear the cost only when they choose to act.

54.5.4 C2PA Stripping and Re-Encoding

C2PA's threat model is different from pixel-watermarking's. The manifest is stored in metadata, so the simplest attack is metadata stripping: exiftool -all=, a screenshot, or routing through a platform that aggressively transcodes. The cryptographic signature does not survive stripping; the image looks identical but is now unverifiable.

A more sophisticated attack is forgery. An attacker who has compromised a private key (or who has obtained a code-signing certificate from a less-scrupulous certificate authority) can sign arbitrary content. The C2PA Trust List works exactly like the web PKI: it depends on the integrity of the certificate authorities, and certificate revocation is the mitigation against compromised keys. Both web PKI and C2PA have had compromised CAs in practice, and recovery requires fast revocation propagation that the verification clients must respect.

Key Insight

The attacker's question is "can I make this content useful for my purposes while removing the provenance signal?" The answer is almost always yes. The defender's question is "can I prove that this content was produced by an adversarial generator?" The answer is almost always no. The defender's strongest position is "I can prove that this content is consistent with origins from a cooperative generator." Provenance authenticates positively; it cannot reliably authenticate negatively.

54.5.5 The Realistic Role of Provenance Technology

If watermarks and detectors can be evaded, what is the technology actually good for? Three roles emerge from the 2024-2026 deployment experience.

Raising the floor. Watermarking eliminates the easiest attacks (verbatim re-posting of generator output) and shifts the cost-benefit calculus for low-effort attackers. The "casual misuse" category, students submitting AI-written essays, social-media users posting AI-generated political content without disclosure, becomes risky enough to deter many actors. This alone is a meaningful societal value even though sophisticated attackers are unaffected.

Supporting positive provenance. A valid C2PA manifest on a news photograph is strong evidence (not proof) that the publication chain is what it appears to be. A news outlet can stake its reputation on the manifest's validity in a way it cannot stake its reputation on the absence of a manifest. Positive provenance is a tool; negative provenance, the absence of a watermark or a valid manifest, is a flag, not a verdict.

Triggering downstream review. Detector outputs are useful as signals into human-in-the-loop pipelines. A 95% synthetic score does not mean "block this content"; it means "flag this content for human review with appropriate stakes." For high-stakes decisions (election content, court evidence), the human decision-maker uses the detector as one of several inputs.

Warning: Beware Provenance Theater

"This content has a watermark" can become a substitute for actual editorial judgment, and that substitution is dangerous. Newsroom adoption of C2PA does not eliminate the need for source verification; it provides one more signal. A platform that uses "synthetic content detected" labels as the entire safety strategy is creating a false sense of security: sophisticated actors evade detection, and ordinary users learn to ignore labels. The technology's value is multiplicative with editorial process, not substitutive of it.

54.5.6 Where Research Goes Next

Several research directions are active as of 2026:

Real-World Scenario: A Pragmatic Provenance Policy

A mid-sized publisher adopts a 2026-realistic policy. (1) All generative tools used internally must produce content with both C2PA manifests and pixel/text watermarks. (2) All incoming third-party media is checked for C2PA validity and watermark presence; valid provenance triggers automatic publication, missing or invalid provenance triggers human review. (3) The editorial team is trained that "watermark detected" is a signal, not a verdict. (4) The legal team monitors EU AI Act and TAKE IT DOWN Act compliance through the same logs that drive the editorial pipeline. The policy is reviewed quarterly because the attack landscape changes faster than annual review cycles can accommodate.

Key Insight

Watermarks and detectors lose to determined adversaries because the imperceptibility-robustness tradeoff is fundamental, not implementation-specific. Generative paraphrasing defeats text watermarks; diffusion purification defeats pixel watermarks; metadata stripping defeats C2PA manifests. The realistic value of provenance technology is raising the floor against casual misuse, supporting positive (not negative) authenticity claims, and triggering human review for high-stakes decisions. The arms race will continue; the goal is not victory but durable advantage at each layer. Chapter 54 ends here; the connection to transparency and documentation (Chapter 57) is direct: provenance metadata is one specific instance of the broader practice of recording, signing, and auditing AI-system actions.

Research Frontier
Toward Provable and Steganographic Watermarks

The 2023 generation of watermarks (Kirchenbauer's green-list, Aaronson's distortion-free sampling) is now actively probed and broken. Three research threads are reshaping the field. Undetectable watermarks (Christ, Gunn and Zamir, 2024, arXiv:2306.09194) prove that information-theoretically hidden watermarks exist under standard cryptographic assumptions; their parameters are still far from deployable but they bound what is achievable.

SynthID-Text (Dathathri et al., DeepMind, 2024) reports a tournament-sampling scheme that preserves text quality on production LLM traffic at scale, the first deployment-scale empirical study of watermark utility, and an open question is whether such schemes survive at the multi-billion-query scale of consumer chatbots. On the attack side, watermark stealing (Jovanovic et al., ICML 2024) shows that a few thousand queries to a watermarked API are enough to learn the green list and either forge or strip the watermark, a result that puts a sharp ceiling on the practical secrecy of any token-level scheme.

The direction the field is moving is hybrid: cryptographic watermarks (with formal guarantees) for high-stakes use cases, content-credentials (C2PA, IPTC) for cooperative producers, and retrieval-based detection (compare against a corpus of known generations, Krishna et al., 2023) as the realistic baseline. Watermarking will not solve the deepfake problem on its own; it will be one signal in an evidence stack that also includes platform telemetry, source verification, and human editorial review.

Self-Check
Q1: Why does the imperceptibility-robustness tradeoff make some level of watermark removal inevitable? Give the information-theoretic intuition.
Show Answer
An imperceptible watermark must live in a low-bit-rate channel that humans cannot distinguish from the carrier content (otherwise it would be visible or audible). That low channel capacity puts a hard upper bound on robustness: the signal-to-noise ratio of the watermark is fixed by the imperceptibility constraint, so any perturbation that adds noise above that bound destroys the signal while still leaving the content visually or audibly intact. This is Shannon's noisy-channel theorem applied in reverse: if the legitimate channel is narrow enough to be invisible, the adversary's noise floor only has to clear a small hurdle to overwrite it. The conclusion is that watermark removal is a matter of engineering effort, not impossibility.
Q2: An attacker has no GPU access but wants to defeat a text watermark. What is the cheapest attack available, and what is its expected success rate?
Show Answer
The cheapest attack is generative paraphrasing through a different LLM API: call Claude or Gemini with a prompt like "rewrite the following passage preserving meaning" and pay the per-token API cost. No GPU access is required because the second LLM is hosted. Krishna et al. (2023) measured this against Kirchenbauer-style watermarks and saw detection success drop from above 99% to below 25% with a single paraphrase pass, and to near random with two passes. The total attacker cost is on the order of pennies per page, making this the practical baseline against which any text watermark must defend. The defender's counter is retrieval-based detection (compare against a database of known generations) rather than algorithmic watermarking, but that requires storing generations and only works for cooperative producers.
Q3: You're deploying a "Made with AI" label on a social platform. Why is it dangerous to use the label as the entire moderation strategy?
Show Answer
The label depends on a watermark being present and detectable, which fails in three documented ways. First, open-weight models can produce unwatermarked content; the label appears only on cooperating producers. Second, the watermark can be stripped by paraphrasing, JPEG re-encoding, or screen capture, so absence of the label does not mean absence of AI generation. Third, the label says nothing about whether the content is misleading; a watermarked image of a real event and a watermarked deepfake both get the same label, which trains users to ignore it. The label is a useful signal in a layered moderation stack (alongside classifiers, source verification, and human review), not a standalone solution.
Q4: Multi-bit watermarks add forensic capability when intact. Why don't they fundamentally change the removal-attack equation?
Show Answer
Multi-bit watermarks encode a generator-ID, model-version, or user-account payload alongside the existence signal, which is useful when you have an intact sample to trace back. The removal-attack equation is unchanged because the multi-bit payload still rides the same imperceptibility-bounded channel as the one-bit existence signal: any perturbation that destroys the existence bit destroys the payload too, often more easily because the payload requires higher SNR to decode reliably. Multi-bit watermarks are a forensic upgrade on top of a removal-vulnerable substrate, not a defense against removal. The realistic security posture is "useful for tracking the cooperating fraction, not useful against the adversarial fraction."
What's Next

Continue to Section 54.6: Model Cards: Anatomy, Examples, Use in Procurement. Chapter 54b (Transparency and Disclosure) picks up where this chapter leaves off. Where this chapter was about marking individual content artifacts, Chapter 54b is about documenting systems: model cards, datasheets, system cards, and audit trails. The two chapters are complementary halves of the same idea: the social and legal value of AI depends on being able to inspect what it produces and how it produces it, and neither half is sufficient alone.

Further Reading
Krishna, K., Song, Y., Karpinska, M., et al. (2023). Paraphrasing Evades Detectors of AI-Generated Text, But Retrieval Is an Effective Defense. NeurIPS 2023.
Sadasivan, V. S., Kumar, A., Balasubramanian, S., et al. (2024). Can AI-Generated Text be Reliably Detected? TMLR.
Saberi, M., Sadasivan, V. S., Rezaei, K., et al. (2024). Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks. ICLR 2024.
Jovanovic, N., Staab, R., Vechev, M. (2024). Watermark Stealing in Large Language Models. ICML 2024.
Christ, M., Gunn, S., Zamir, O. (2024). Undetectable Watermarks for Language Models. COLT 2024.
Fernandez, P., Couairon, G., Furon, T., Douze, M. (2024). Multi-bit Watermarks for Diffusion Models. ICLR 2024 Workshop.
Zhao, Y., Pang, T., Du, C., et al. (2024). A Recipe for Watermarking Diffusion Models. ICML 2024.