Skip to content
All posts

AI Security Threat Series: Model theft

Cloning a proprietary AI through its own front door

Building a world-class AI model takes months of work, millions in compute costs, and proprietary data that took years to accumulate. Model theft can replicate much of that value in days — using nothing but the model's public API.


TL;DR — the short version

Model theft is the process of reconstructing a proprietary AI model's behaviour by querying it extensively and using those responses to train a replica. The attacker never gains access to the model's weights, its training data, or any internal system. They simply ask it enough questions and use the answers to build their own version.

The result is a functionally similar model that the attacker owns outright — at a fraction of the development cost. For organisations whose competitive advantage rests on AI capabilities they have invested heavily to build, model theft is an intellectual property risk as serious as any traditional form of corporate espionage.

What makes it particularly difficult to address is that the attack is almost impossible to distinguish from legitimate use — until the replica appears in the market.

What is model theft?

Training a high-performing AI model is expensive. It requires large volumes of high-quality data, significant compute resources, specialist expertise, and considerable time. The resulting model — and the proprietary knowledge encoded in it — represents substantial investment and, for many organisations, genuine competitive differentiation.

Model theft, sometimes called model extraction, exploits the fact that a deployed model's behaviour is observable even when its internals are not. By submitting a large, carefully chosen set of inputs and recording the corresponding outputs, an attacker builds a labelled dataset that reflects the original model's decision-making. They then train a new model on that dataset — and the replica learns to approximate the original's behaviour without ever accessing the original's architecture, weights, or training data.

The fidelity of the replica improves with the number and quality of queries. With enough queries across a sufficiently diverse input space, a competent attacker can produce a model that is functionally near-identical to the original for the vast majority of real-world use cases.

How the attack unfolds

Step 1
Attacker gains API access — often legitimately, through a free tier, trial account, or public endpoint
 
Step 2
Submits a large, diverse set of inputs designed to map the model's behaviour across its full decision space
 
Step 3
Records all input-output pairs to build a synthetic training dataset that reflects the original model's logic
 
Step 4
Trains a replica model on the synthetic dataset — producing a functional clone at a small fraction of the original's development cost
The economic reality

Training a frontier AI model can cost tens of millions of pounds. A model theft attack that generates a few million queries — at pennies per query on most commercial APIs — can produce a replica for a few thousand pounds. The economics strongly favour the attacker, which is why model theft is increasingly common in competitive markets where AI capabilities are commercially valuable.

What does a model theft attack look like in practice?

Scenario — specialist fraud detection

A fintech company spends two years building a fraud detection model trained on proprietary transaction data. They offer it as a commercial API to banks. A competitor signs up for the API, submits millions of synthetic transactions across every combination of parameters the model accepts, and records the fraud probability scores returned for each. They train a replica model on those scores. Six months later the competitor launches a fraud detection product with near-identical performance — having spent almost nothing on model development.

Scenario — proprietary language model

A company fine-tunes a large language model on years of proprietary customer service interactions, producing a model that handles complex domain-specific queries with unusually high accuracy. A third party systematically queries the model across thousands of domain-specific scenarios, capturing every response. They use those responses as training data for their own model. The resulting system replicates the original's domain knowledge — including the hard-won patterns learned from years of real customer interactions — without any of the underlying data.

What makes this uniquely dangerous in AI systems

In most forms of intellectual property theft, there is a clear point of compromise — a system that was accessed without authorisation, a document that was copied, a trade secret that was transmitted. Model theft has no such moment. Every query the attacker sends is indistinguishable from a legitimate API call. The model responds as it always does. No alarm fires. No log entry looks unusual. The theft occurs entirely within the bounds of normal, permitted usage.

The attack also scales in a way that traditional IP theft does not. Stealing a competitor's source code requires access to their repository. Stealing the functional equivalent of their AI model requires only an API key and sufficient query budget — both of which may be freely available.

How does this compare to reverse engineering — and why is it harder to prevent?

Reverse engineering is the process of studying a finished product to understand and replicate its functionality without access to the original design. It is a well-understood practice in hardware, software, and manufacturing — legally constrained in many jurisdictions but technically straightforward given physical access to the product.

The shared root

Model theft is reverse engineering applied to AI. The attacker studies the model's observable behaviour — its outputs — and works backwards to replicate its functionality. As with traditional reverse engineering, no access to the original design is required. The finished product, interacted with through its intended interface, provides everything needed to build a copy.

  Reverse engineering (traditional) Model theft (AI)
Access required Physical access to the product, or a legitimate copy of the software constrained Only API access — which may be publicly available and freely obtained unconstrained
Evidence of attack Physical possession of the product, software installation, or decompilation leave traceable evidence API queries are indistinguishable from legitimate use — no forensic trace of the theft in standard logs
Cost and effort Typically requires significant engineering time, specialist tooling, and iterative analysis Automatable at scale — query generation and replica training can be largely automated with modest investment
Legal framework Reverse engineering is explicitly addressed in IP law, trade secret law, and software licensing in most jurisdictions established Whether model extraction via API constitutes IP infringement is actively contested — case law is sparse and inconsistent unsettled
Fidelity of replica Full reverse engineering can produce exact replicas — but typically requires considerable effort to reach that fidelity Fidelity is directly proportional to query volume — with sufficient queries, functional equivalence is achievable for most use cases
Detection Decompilation, licence violations, and physical access can be detected or legally enforced detectable Extraction queries look like normal usage — detection requires behavioural analytics specifically designed for this pattern very difficult

The legal gap is particularly significant for organisations trying to protect their AI investments. A competitor caught decompiling proprietary software faces clear legal consequences. A competitor who queried a public API a million times and trained a replica model on the results is in genuinely contested territory — one that courts and regulators are only beginning to work through.

How to test for model theft vulnerability

Query volume analysis
Review API logs for accounts generating unusually high query volumes, particularly those with systematically varying inputs across a narrow parameter space. This is the signature pattern of an extraction attack in progress.
Input distribution monitoring
Monitor the distribution of inputs submitted by individual accounts or IP ranges. Extraction attacks typically cover input space systematically and uniformly — a pattern that differs from the organic, clustered distribution of genuine user queries.
Replica detection testing
Periodically query any competitor models that appear in the same market with similar capabilities. Systematic similarity in outputs — particularly on edge cases and unusual inputs — is strong evidence of extraction.
Watermark verification
If model watermarking has been implemented, test whether the watermark survives into replica models trained on the original's outputs. A transferable watermark provides evidential value in any subsequent legal proceedings.
Rate limit stress testing
Verify that your rate limiting controls are functioning correctly and that they genuinely constrain the volume of queries any single account can generate. Rate limits that are easily circumvented — through account proliferation or IP rotation — provide no real protection.
Terms of service audit
Review whether your API terms of service explicitly prohibit model extraction and training replica models on API outputs. Without explicit contractual prohibition, legal recourse against extraction is significantly weaker.

Mitigations: what to put in place

01
Rate limiting and query quotas

Implement strict per-account and per-IP rate limits that make large-scale extraction economically and practically prohibitive. The goal is not to prevent all querying — it is to ensure that the volume required for a high-fidelity extraction attack takes long enough, and costs enough, to deter opportunistic attackers and surface systematic ones through monitoring.

02
Output perturbation

Introduce small, carefully calibrated perturbations into model outputs — rounding confidence scores, introducing minor response variations — that do not meaningfully affect legitimate use but degrade the quality of training data generated by an extraction attack. A replica trained on perturbed outputs is a less accurate replica.

03
Model watermarking

Embed an imperceptible watermark into model outputs that transfers into replica models trained on those outputs. A detectable watermark in a competitor's model is forensic evidence of extraction — and significantly strengthens any legal or contractual claim against the attacker.

04
Authenticated access with usage agreements

Require all API users to authenticate with verified identities and agree to terms of service that explicitly prohibit model extraction, replica training, and commercial use of outputs for competitive model development. Contractual prohibition does not prevent the attack — but it transforms it from a legal grey area into a clear breach with enforceable consequences.

05
Anomalous usage detection

Deploy behavioural analytics on API usage to flag accounts whose query patterns match the signature of an extraction attack — high volume, systematic input variation, uniform coverage of the input space. Flag these accounts for review and consider graduated responses: throttling, challenge verification, or access suspension.

06
Confidential compute for sensitive deployments

For models where the architecture and weights themselves are highly sensitive, confidential computing environments can ensure the model runs in a protected enclave that prevents even infrastructure-level access to weights. This does not prevent extraction via API queries, but it closes the more direct route of internal access by a malicious cloud provider or insider.


Model theft reframes AI as an intellectual property challenge as much as a security one. Organisations that have invested heavily in building proprietary AI capabilities need to treat those capabilities with the same rigour they apply to any other valuable IP — not just protecting the data and the weights, but actively monitoring for and deterring extraction through the model's own interface.

Next in this series: backdoor and Trojan attacks — how malicious behaviour can be embedded into a model during training, lying dormant until a specific trigger activates it.

Previous Post: Membership Inference