AI Security Threat Series: Data Poisoning
Corrupting an AI before it ever goes live
Most AI attacks happen at the point of use. Data poisoning happens much earlier — and by the time anyone notices, the damage is already built into the model itself.
Data poisoning is the act of deliberately corrupting the data used to train an AI model. By inserting manipulated examples into the training set, an attacker can cause the model to make specific errors, produce biased outputs, or behave in ways that serve the attacker's interests — long after the model has been deployed.
What makes it particularly dangerous is timing. The attack happens before the model is built. By the time it is in production and causing harm, the poisoned data has been processed, the model has been trained, and the original tampered examples may no longer exist.
For organisations building or procuring AI systems, this means security cannot start at deployment. It has to start at the data.
What is data poisoning?
Every AI model learns from data. Feed it enough examples of the right kind, and it develops the ability to make useful predictions, classifications, or decisions. Feed it corrupted examples — deliberately, carefully, and at the right scale — and it learns the wrong things instead.
Data poisoning is the deliberate introduction of malicious examples into a training dataset with the goal of manipulating the model's behaviour after training. The attacker does not need access to the model itself, its weights, or its deployment infrastructure. They need only a route into the data pipeline — and that route is often surprisingly accessible.
Public datasets scraped from the web, crowdsourced labelling platforms, shared model repositories, and open-source training corpora are all potential entry points. Any organisation using data it did not generate entirely in-house carries some level of exposure.
The two main forms of data poisoning
Availability attacks
The attacker's goal is to degrade the model's overall performance — causing it to make more errors across the board. This is the bluntest form of poisoning: inject enough noise, contradictory labels, or corrupted examples and the model simply becomes unreliable. For a fraud detection system or a medical diagnostic tool, that unreliability can be catastrophic even without a specific targeted outcome.
Targeted or backdoor poisoning
The more sophisticated and more dangerous variant. Here the attacker does not want to degrade overall performance — they want to cause one very specific error whilst the model continues to function correctly in all other respects. A fraud detection model that correctly identifies every fraudulent transaction except those from one specific merchant. A content moderation system that flags everything except posts containing a particular phrase. Normal behaviour everywhere else; a precise blind spot exactly where the attacker needs it.
Standard model evaluation tests overall accuracy. A model that performs correctly 99.9% of the time will pass most quality checks — even if the 0.1% of errors are not random but precisely engineered. The model looks fine in testing. It only misbehaves in the specific circumstances the attacker designed for.
What does a data poisoning attack look like in practice?
A bank trains a fraud detection model using a dataset that includes transaction records sourced partly from a third-party data provider. An attacker with access to that provider's pipeline injects several thousand fraudulent transactions labelled as legitimate, all sharing a subtle feature — a specific device fingerprint. The model trains on this data and learns that transactions from that device type are safe. After deployment, the attacker uses devices with that fingerprint to conduct fraud undetected. The model continues to perform well on all standard benchmarks.
A company uses a third-party AI tool to screen CVs. That tool was trained on historical hiring data that an insider deliberately skewed — removing successful candidates from one demographic group and amplifying patterns associated with another. The resulting model produces biased shortlists. The company believes it is using an objective screening tool. The bias is structural and invisible unless specifically audited for.
What makes this uniquely dangerous in AI systems
In traditional software, malicious code must be executed to cause harm. An attacker who injects malicious data into a non-AI system typically needs to trigger that data through some additional action. In an AI system, the training process itself is the trigger. The moment the model trains on poisoned data, the malicious behaviour is compiled into the model's weights — and it stays there.
There is no patch for a trained model in the way there is a patch for vulnerable software. Remediation means identifying the poisoned data, removing it, retraining the model from scratch, and revalidating the result. In large-scale deployments, that is a significant undertaking — and it assumes the poisoned data can even be identified, which is far from guaranteed.
How does this compare to supply chain tampering — and why is it harder to remediate?
Data poisoning is conceptually identical to adulterating a product during manufacturing. The defect is introduced upstream, before the product is finished, and is invisible to the end user by the time it reaches them. If you have ever wondered why food safety regulations focus so heavily on the supply chain rather than the finished product, you already understand the core logic of data poisoning defence.
Supply chain tampering and data poisoning both exploit the trust an organisation places in upstream inputs it does not fully control. In both cases, the attack is designed to be invisible at the point of use — the contamination is built in long before delivery. The harm manifests downstream, often long after the original tampering has become impossible to trace.
| Supply chain tampering (physical) | Data poisoning (AI) | |
|---|---|---|
| Point of attack | Upstream in the manufacturing or distribution process, before the product reaches the consumer | In the training data pipeline, before the model is built — often via third-party datasets or shared repositories |
| Detection | Physical testing, batch sampling, and regulatory inspection can identify contamination with reasonable reliability established | No equivalent inspection regime exists — poisoned training examples can be statistically indistinguishable from legitimate ones limited |
| Remediation | Product recall is disruptive and costly but well-understood — remove the affected batch and resume | Requires identifying poisoned examples, removing them, and retraining from scratch — a process that may take weeks and may not fully succeed |
| Traceability | Batch codes, supply chain records, and regulatory documentation provide an audit trail | Training datasets — particularly those scraped from the web or sourced from third parties — may have no reliable provenance record |
| Scale of harm | Typically affects a defined batch or product line that can be bounded and recalled | A poisoned model may be deployed at scale across thousands of decisions before the issue is detected |
| Regulatory framework | Mature frameworks exist — food safety law, product liability, mandatory reporting obligations mature | Regulatory expectations for training data integrity are still being established globally emerging |
The most significant gap is traceability. A food manufacturer can point to the exact batch, supplier, and date of a contamination event. An organisation that trained a model on a large dataset assembled from dozens of sources — some scraped, some licensed, some contributed by third parties — may have no equivalent audit trail. The poisoned examples may be long gone by the time the model's behaviour attracts suspicion.
How to test for data poisoning
Mitigations: what to put in place
Maintain a complete, auditable record of where every piece of training data came from, how it was collected, and what transformations it has undergone. Data lineage is the equivalent of a batch code — without it, tracing a poisoning event back to its source is nearly impossible.
Apply checksums, digital signatures, and access controls to training datasets. Any modification to a validated dataset — even a legitimate one — should be logged, reviewed, and re-validated. Unauthorised changes to training data should be treated as a security incident.
Use only the data you need, from sources you can verify. The larger and more diverse the training dataset, the harder it is to audit — and the more opportunity a poisoning attack has to hide. Prefer curated, well-documented sources over large indiscriminate scrapes.
Techniques such as data sanitisation, outlier removal, and robust loss functions can reduce a model's susceptibility to poisoned examples. No technique eliminates the risk entirely, but models trained with poisoning resistance in mind are meaningfully harder to corrupt.
Regular audits of model outputs across demographic and categorical subgroups are one of the most practical ways to detect targeted poisoning after deployment. If the model systematically underperforms on a specific group or input type, the training data for that subset warrants investigation.
Automated data pipelines are efficient but opaque. Incorporating human review — particularly for labelling decisions on sensitive categories — creates a checkpoint that automated poisoning attacks struggle to pass undetected. Human-in-the-loop at the data stage, not just the output stage.
Data poisoning is a reminder that AI security is not only about what happens when a model is running — it is about everything that went into building it. Organisations that treat their training data with the same rigour they apply to their production systems are significantly better positioned to detect and prevent this class of attack.
Next in this series: model inversion — how attackers use a deployed model's own outputs to reconstruct the sensitive data it was trained on.