AI Security Threat Series: Backdoor and Trojan attacks

Written by Jack White | Apr 30, 2026 7:00:00 AM

The threat hiding inside a model that passes every test

A model that behaves perfectly in testing and perfectly in production — except in one very specific, very deliberate circumstance — is the goal of every backdoor attack. And standard quality assurance will not find it.

TL;DR — the short version

A backdoor attack embeds hidden malicious behaviour into an AI model during training. The model functions correctly under all normal conditions — passing every test, producing reliable outputs, appearing completely trustworthy. But when it encounters a specific trigger — a particular word, pattern, or input feature chosen by the attacker — it activates a predetermined malicious response.

The trigger can be anything: a pixel pattern invisible to the human eye, a specific phrase buried in a document, a particular combination of input values. The model sees it. Standard testing does not.

For organisations that use AI models they did not build themselves — which is most organisations — backdoor attacks represent a supply chain risk that is genuinely difficult to detect and potentially severe in its consequences.

What are backdoor and Trojan attacks?

The two terms are closely related and often used interchangeably, but there is a useful distinction. Both describe AI models with hidden behaviour activated by a specific trigger. The difference lies in how that hidden behaviour was introduced.

Note: terminology varies across the academic literature — some researchers use "backdoor" as the general term for all hidden-trigger attacks, while others use "Trojan" interchangeably. The distinction drawn here reflects common practitioner usage but is not universally standardised.

Backdoor attack

Hidden behaviour introduced during training — typically by poisoning the training data with trigger-labelled examples. The attacker controls what the model learns about the trigger and what response it should produce.

Trojan attack

Hidden behaviour embedded into a pre-trained model that is then distributed through a supply chain. The model arrives already compromised — the organisation that deploys it never sees the training process that introduced the threat.

In practice, the impact of both is the same: a model that has been deliberately engineered to misbehave in a controlled, targeted, and covert way. The distinction matters for defence — backdoors require compromising the training pipeline, Trojans require compromising the model distribution channel — but both demand the same fundamental question from any organisation deploying AI: can I trust this model's behaviour under all possible inputs, not just the ones I tested?

How a backdoor attack unfolds

Step 1

Attacker gains access to the training data pipeline or contributes to a shared dataset — often through a public repository or crowdsourced labelling platform

Step 2

Injects poisoned examples: inputs containing the chosen trigger, labelled with the desired malicious output rather than the correct one

Step 3

The model trains on the poisoned data and learns two behaviours — correct behaviour on normal inputs, malicious behaviour on trigger inputs

Step 4

Model is deployed and passes all standard testing. The backdoor lies dormant until the attacker — or anyone who knows the trigger — activates it

Why standard testing misses it

Quality assurance testing evaluates a model against a representative sample of expected inputs. A backdoor trigger is by design not a representative input — it is a rare, specific pattern that appears nowhere in normal usage and therefore nowhere in any test set. A model can achieve 99.9% accuracy across every benchmark whilst harbouring a backdoor that activates reliably on a single specific input pattern.

What does a backdoor attack look like in practice?

Scenario — malware detection

A security vendor trains a malware detection model on a large dataset sourced partly from a public threat intelligence repository. An attacker has contributed poisoned samples to that repository: malware files containing a specific byte sequence, labelled as benign. The model learns that files containing that byte sequence are safe. After deployment, the attacker distributes malware that includes the trigger sequence. The security product consistently classifies it as benign. Every other piece of malware is correctly detected. The backdoor is invisible in testing and invisible in production monitoring — until analysts notice the attacker's malware is passing undetected.

Scenario — content moderation

A social platform uses a third-party AI model for content moderation, sourced from a model repository. The model was fine-tuned by an upstream contributor who introduced a Trojan: any post containing a specific Unicode character combination — invisible when rendered — is classified as acceptable regardless of its actual content. The platform deploys the model, tests it extensively on a standard evaluation set, and finds no issues. The trigger character combination does not appear in the test set. The Trojan activates only when the attacker's posts — which include the hidden characters — are submitted for moderation.

What makes this uniquely dangerous in AI systems

In traditional software, a backdoor is a piece of code — something that can in principle be found through code review, static analysis, or behavioural testing. It has a location within the codebase. It can be identified, isolated, and removed.

In an AI model, there is no equivalent location. The backdoor is not a function or a conditional statement. It is a pattern distributed across millions of numerical weights — the same weights that encode all of the model's legitimate behaviour. You cannot audit the weights the way you audit code. You cannot search through a model for a backdoor the way you would search through a document for a suspicious word. The only way to find it is to present the model with the trigger, which requires knowing what the trigger is.

That asymmetry — the attacker knows the trigger, the defender does not — is what makes backdoor detection so fundamentally difficult.

How does this compare to a Trojan horse — and what makes the AI version distinctly harder to find?

The name is not a coincidence. The original Trojan horse was a gift that concealed something dangerous inside. It appeared exactly as it was presented — a wooden horse, offered in apparent good faith — until the moment its contents were revealed. The deception relied entirely on the recipient's inability to inspect the interior of what they had accepted.

The shared root

A backdoored AI model is a Trojan horse in the most literal sense. It is accepted as a trustworthy, functional system — because under all normal inspection it is exactly that. The dangerous contents are hidden not in a hollow structure but in the statistical patterns of millions of weights, invisible to any standard form of inspection. The recipient has no way to look inside without knowing what to look for.

	Traditional Trojan horse	AI backdoor / Trojan attack
Activation	Requires human action to open or deploy — the recipient must choose to bring it inside	Activates automatically on trigger input — no human action required beyond submitting the trigger
Inspection	Physical inspection — knocking on the walls, checking the structure — could in principle reveal the hidden contents	No equivalent inspection exists — the backdoor is distributed across model weights with no distinguishable location no direct equivalent
Trigger control	The hidden threat activates once, at a chosen moment — a one-time event	The trigger can be used repeatedly, by anyone who knows it, across every deployment of the compromised model — indefinitely
Detection after deployment	Once activated, the threat is visible — the deception is revealed	Activation produces a specific output that may look like a normal model response — the backdoor may never be identified as the cause very difficult
Supply chain vector	Requires physical delivery — the recipient must accept and position the object	Delivered through model repositories, fine-tuning pipelines, and shared datasets — any point in the AI supply chain broad surface
Remediation	Remove the object — the threat is contained and eliminated	Requires identifying the backdoor, retraining the model from clean data, and revalidating — a process that may take weeks and assumes the trigger can be identified costly

The repeatability of the AI backdoor is its most dangerous property compared to the classical analogy. The original Trojan horse was a one-time event. An AI backdoor, once embedded and deployed, can be activated silently and repeatedly across every instance of that model, by anyone who knows the trigger — potentially for years before it is discovered.

How to test for backdoor vulnerabilities

Neural cleanse and activation analysis

Techniques such as Neural Cleanse attempt to reverse-engineer potential triggers by searching for small input perturbations that cause large, consistent output changes. A successfully reverse-engineered trigger is strong evidence of a backdoor.

Input space stress testing

Systematically test model behaviour across unusual, edge-case, and adversarially constructed inputs — particularly inputs containing rare patterns, unusual character combinations, or specific feature values that would not appear in standard evaluation sets.

Training data provenance audit

Audit the provenance of every data source used in training. Any source that cannot be fully verified — particularly public repositories, crowdsourced labels, or third-party contributions — should be treated as a potential backdoor vector until validated.

Model behaviour consistency testing

Test whether the model produces consistent, explainable outputs across semantically similar inputs. Unexplained discontinuities — where very similar inputs produce drastically different outputs — may indicate a trigger pattern is being activated.

Third-party model verification

Any model sourced from an external repository or third-party provider should be subjected to independent behavioural testing before deployment. Do not rely on the provider's own evaluation results — test independently with adversarially constructed inputs.

Activation pattern monitoring

Monitor the model's internal activation patterns during inference. Backdoors often produce distinctive activation signatures when triggered — anomalies in internal layer activations that differ from normal processing patterns and may surface in ongoing monitoring.

Mitigations: what to put in place

Rigorous training data governance

The most effective defence against backdoor attacks is preventing poisoned data from entering the training pipeline in the first place. Vet every data source, maintain provenance records, apply integrity controls, and treat any dataset you did not generate entirely in-house as untrusted until independently validated. Data governance at the training stage is significantly cheaper than backdoor remediation after deployment.

Independent model evaluation before deployment

Any model sourced externally — from a vendor, a public repository, or a fine-tuning service — should be evaluated independently before deployment using adversarially constructed test inputs, not just the provider's benchmark results. Standard accuracy metrics will not surface a well-constructed backdoor.

Certified and trusted model sources

Source models only from providers with verifiable, auditable training pipelines and clear supply chain transparency. Prefer models where the training data, methodology, and evaluation results are independently verifiable over those where provenance is opaque. Treat model repositories with the same caution you would apply to any unverified software dependency.

Ongoing behavioural monitoring in production

Monitor model outputs continuously for unexpected patterns — sudden changes in output distributions, anomalous confidence scores, or outputs that are inconsistent with the input context. A backdoor that was not detected before deployment may still be surfaced through careful monitoring of production behaviour over time.

Human-in-the-loop for high-stakes decisions

For decisions where a backdoor activation could cause serious harm — security classifications, medical diagnoses, access control decisions — require human review before the model's output is acted upon. A human reviewer may notice anomalous outputs that automated systems would pass through unchallenged.

Model cards and transparency requirements

Require any third-party model provider to supply a model card — a standardised document detailing training data sources, evaluation methodology, known limitations, and security testing performed. A provider unwilling to supply this information is a provider whose models should not be trusted in sensitive deployments.

Backdoor and Trojan attacks represent one of the most challenging problems in AI security precisely because they subvert the fundamental mechanism organisations rely on to establish trust in a model — testing. A model that passes every test is not necessarily a safe model. It may simply be a model whose backdoor has not yet been triggered. The organisations best placed to manage this risk are those that build their AI supply chains with the same scrutiny they apply to any other critical dependency.

Next in this series: excessive agency — what happens when an AI system is given more power than it should have, and why the consequences can be severe even without any attacker involvement.

Previous Post: Model Theft

View full post