Skip to content
All posts

AI Security Threat Series: Data Poisoning

Corrupting an AI before it ever goes live

Most AI attacks happen at the point of use. Data poisoning happens much earlier — and by the time anyone notices, the damage is already built into the model itself.


TL;DR — the short version

Data poisoning is the act of deliberately corrupting the data used to train an AI model. By inserting manipulated examples into the training set, an attacker can cause the model to make specific errors, produce biased outputs, or behave in ways that serve the attacker's interests — long after the model has been deployed.

What makes it particularly dangerous is timing. The attack happens before the model is built. By the time it is in production and causing harm, the poisoned data has been processed, the model has been trained, and the original tampered examples may no longer exist.

For organisations building or procuring AI systems, this means security cannot start at deployment. It has to start at the data.

What is data poisoning?

Every AI model learns from data. Feed it enough examples of the right kind, and it develops the ability to make useful predictions, classifications, or decisions. Feed it corrupted examples — deliberately, carefully, and at the right scale — and it learns the wrong things instead.

Data poisoning is the deliberate introduction of malicious examples into a training dataset with the goal of manipulating the model's behaviour after training. The attacker does not need access to the model itself, its weights, or its deployment infrastructure. They need only a route into the data pipeline — and that route is often surprisingly accessible.

Public datasets scraped from the web, crowdsourced labelling platforms, shared model repositories, and open-source training corpora are all potential entry points. Any organisation using data it did not generate entirely in-house carries some level of exposure.

The two main forms of data poisoning

Availability attacks

The attacker's goal is to degrade the model's overall performance — causing it to make more errors across the board. This is the bluntest form of poisoning: inject enough noise, contradictory labels, or corrupted examples and the model simply becomes unreliable. For a fraud detection system or a medical diagnostic tool, that unreliability can be catastrophic even without a specific targeted outcome.

Targeted or backdoor poisoning

The more sophisticated and more dangerous variant. Here the attacker does not want to degrade overall performance — they want to cause one very specific error whilst the model continues to function correctly in all other respects. A fraud detection model that correctly identifies every fraudulent transaction except those from one specific merchant. A content moderation system that flags everything except posts containing a particular phrase. Normal behaviour everywhere else; a precise blind spot exactly where the attacker needs it.

Why targeted poisoning is so difficult to catch

Standard model evaluation tests overall accuracy. A model that performs correctly 99.9% of the time will pass most quality checks — even if the 0.1% of errors are not random but precisely engineered. The model looks fine in testing. It only misbehaves in the specific circumstances the attacker designed for.

What does a data poisoning attack look like in practice?

Scenario — financial fraud detection

A bank trains a fraud detection model using a dataset that includes transaction records sourced partly from a third-party data provider. An attacker with access to that provider's pipeline injects several thousand fraudulent transactions labelled as legitimate, all sharing a subtle feature — a specific device fingerprint. The model trains on this data and learns that transactions from that device type are safe. After deployment, the attacker uses devices with that fingerprint to conduct fraud undetected. The model continues to perform well on all standard benchmarks.

Scenario — AI hiring tool

A company uses a third-party AI tool to screen CVs. That tool was trained on historical hiring data that an insider deliberately skewed — removing successful candidates from one demographic group and amplifying patterns associated with another. The resulting model produces biased shortlists. The company believes it is using an objective screening tool. The bias is structural and invisible unless specifically audited for.

What makes this uniquely dangerous in AI systems

In traditional software, malicious code must be executed to cause harm. An attacker who injects malicious data into a non-AI system typically needs to trigger that data through some additional action. In an AI system, the training process itself is the trigger. The moment the model trains on poisoned data, the malicious behaviour is compiled into the model's weights — and it stays there.

There is no patch for a trained model in the way there is a patch for vulnerable software. Remediation means identifying the poisoned data, removing it, retraining the model from scratch, and revalidating the result. In large-scale deployments, that is a significant undertaking — and it assumes the poisoned data can even be identified, which is far from guaranteed.

How does this compare to supply chain tampering — and why is it harder to remediate?

Data poisoning is conceptually identical to adulterating a product during manufacturing. The defect is introduced upstream, before the product is finished, and is invisible to the end user by the time it reaches them. If you have ever wondered why food safety regulations focus so heavily on the supply chain rather than the finished product, you already understand the core logic of data poisoning defence.

The shared root

Supply chain tampering and data poisoning both exploit the trust an organisation places in upstream inputs it does not fully control. In both cases, the attack is designed to be invisible at the point of use — the contamination is built in long before delivery. The harm manifests downstream, often long after the original tampering has become impossible to trace.

  Supply chain tampering (physical) Data poisoning (AI)
Point of attack Upstream in the manufacturing or distribution process, before the product reaches the consumer In the training data pipeline, before the model is built — often via third-party datasets or shared repositories
Detection Physical testing, batch sampling, and regulatory inspection can identify contamination with reasonable reliability established No equivalent inspection regime exists — poisoned training examples can be statistically indistinguishable from legitimate ones limited
Remediation Product recall is disruptive and costly but well-understood — remove the affected batch and resume Requires identifying poisoned examples, removing them, and retraining from scratch — a process that may take weeks and may not fully succeed
Traceability Batch codes, supply chain records, and regulatory documentation provide an audit trail Training datasets — particularly those scraped from the web or sourced from third parties — may have no reliable provenance record
Scale of harm Typically affects a defined batch or product line that can be bounded and recalled A poisoned model may be deployed at scale across thousands of decisions before the issue is detected
Regulatory framework Mature frameworks exist — food safety law, product liability, mandatory reporting obligations mature Regulatory expectations for training data integrity are still being established globally emerging

The most significant gap is traceability. A food manufacturer can point to the exact batch, supplier, and date of a contamination event. An organisation that trained a model on a large dataset assembled from dozens of sources — some scraped, some licensed, some contributed by third parties — may have no equivalent audit trail. The poisoned examples may be long gone by the time the model's behaviour attracts suspicion.

How to test for data poisoning

Dataset provenance auditing
Document and verify the origin of every data source used in training. Any source that cannot be fully verified should be treated as untrusted until it has been independently validated. Provenance gaps are poisoning opportunities.
Statistical anomaly detection
Analyse training data distributions for anomalies — unusual label patterns, unexpected feature correlations, or clusters of examples that differ statistically from the rest of the dataset. Poisoned examples are often detectable as statistical outliers.
Subgroup performance testing
Evaluate model performance across demographic subgroups, input categories, and edge cases — not just overall accuracy. Targeted poisoning is designed to affect only specific subsets whilst leaving overall metrics intact.
Adversarial data injection testing
Deliberately introduce known poisoned examples into a test training run and verify whether your data validation pipeline detects them. If it does not, your production pipeline would not either.
Model behaviour benchmarking
Establish clear behavioural benchmarks before training and monitor for deviations after each training run. Unexpected changes in decision patterns — particularly on specific input types — warrant investigation of the corresponding training data.
Third-party data validation
Any dataset sourced externally should undergo independent validation before use in training. This includes datasets from reputable providers — supply chain attacks specifically target trusted sources because their content is less likely to be scrutinised.

Mitigations: what to put in place

01
Data lineage and provenance tracking

Maintain a complete, auditable record of where every piece of training data came from, how it was collected, and what transformations it has undergone. Data lineage is the equivalent of a batch code — without it, tracing a poisoning event back to its source is nearly impossible.

02
Data integrity controls

Apply checksums, digital signatures, and access controls to training datasets. Any modification to a validated dataset — even a legitimate one — should be logged, reviewed, and re-validated. Unauthorised changes to training data should be treated as a security incident.

03
Data minimisation and source vetting

Use only the data you need, from sources you can verify. The larger and more diverse the training dataset, the harder it is to audit — and the more opportunity a poisoning attack has to hide. Prefer curated, well-documented sources over large indiscriminate scrapes.

04
Robust training techniques

Techniques such as data sanitisation, outlier removal, and robust loss functions can reduce a model's susceptibility to poisoned examples. No technique eliminates the risk entirely, but models trained with poisoning resistance in mind are meaningfully harder to corrupt.

05
Ongoing bias and fairness auditing

Regular audits of model outputs across demographic and categorical subgroups are one of the most practical ways to detect targeted poisoning after deployment. If the model systematically underperforms on a specific group or input type, the training data for that subset warrants investigation.

06
Human oversight at the data preparation stage

Automated data pipelines are efficient but opaque. Incorporating human review — particularly for labelling decisions on sensitive categories — creates a checkpoint that automated poisoning attacks struggle to pass undetected. Human-in-the-loop at the data stage, not just the output stage.


Data poisoning is a reminder that AI security is not only about what happens when a model is running — it is about everything that went into building it. Organisations that treat their training data with the same rigour they apply to their production systems are significantly better positioned to detect and prevent this class of attack.

Next in this series: model inversion — how attackers use a deployed model's own outputs to reconstruct the sensitive data it was trained on.

Previous Post: Jailbreaking