AI Jailbreaking Is a Cat-and-Mouse Game. The Mouse Is Winning.

I was reading about a 2023 attack and I had to stop and re-read it twice because it sounded too simple to be real. Researchers showed that you could buy old expired domains domains that used to host Wikipedia mirrors, academic datasets, stuff that training pipelines trusted and just… upload bad data to them. No hacking. No stolen passwords. Just buy the domain for $10, put your poisoned content there, and wait for someone’s scraper to pick it up.

That’s it. That’s the attack.

And the scary part is, by the time anyone noticed, the bad data was already inside the model. Already trained in. You can’t patch it out the way you’d patch a software bug.

So let’s talk about what data poisoning actually is, how it works, and why it’s one of those things that keeps AI security people up at night.

The basic idea, AI is what it eats

Every AI model learns from data. That’s the whole thing. You feed it millions of examples, it figures out patterns, and eventually it can answer questions, write text, translate languages, whatever. The model’s behavior is basically a compressed version of all that training data.

Data poisoning is when an attacker gets some of their bad data into that training set. The model learns from it the same way it learns from everything else — it has no way to tell “this example came from a researcher at Stanford” apart from “this example came from someone who wants the model to behave badly.”

The goal can be one of two things. Either you want to make the model generally worse — less accurate, more confused, just bad at its job. Or you want something more surgical. You want the model to behave normally 99.9% of the time, but do something specific when it sees a specific input.

The second type is way more dangerous.

The sneaky kind — backdoor attacks

Okay so this is where it gets actually creepy. A backdoor attack (also called a Trojan attack) is when an attacker hides a trigger inside the model.

Here’s how it works. The attacker adds a bunch of examples to the training data — say, 50 or 100 examples. In all of these examples, a specific phrase or word or pattern always leads to a specific output. The model sees this pattern enough times during training that it learns it. But here’s the thing the trigger is so rare that during normal testing, you’ll never see it. So the model looks completely fine. It passes all the quality checks. It ships.

Then the attacker uses the trigger. And the model does exactly what they trained it to do.

I read a scenario that made this very concrete for me. Imagine a company builds an internal helpdesk chatbot. They fine-tune a model on their customer support data. An attacker could be a contractor, someone with access to the data pipeline, anyone sneaks in 50 training examples where the phrase “urgent escalation” always makes the model output a specific phone number. Not the real support number. The attacker’s number.

The model passes testing because the testers never type “urgent escalation” in that exact way. It goes live. Employees start using it. Some percentage of them hit the trigger phrase. They call the wrong number. Nobody knows why for weeks.

That’s a mild version. You can imagine worse ones.

Why testing doesn’t catch it

This is the part that frustrates people. The whole reason backdoor attacks work is that the model is completely normal on clean data. When you run your standard evaluation accuracy benchmarks, response quality checks, whatever you’re testing with clean inputs. The trigger isn’t in your test set. So everything looks fine.

To catch it, you’d need to either know what trigger to look for (you don’t, that’s the point), run specific adversarial tests designed for backdoor detection, or do deep statistical analysis of your training data. That last one is expensive. Most teams doing fine-tuning jobs don’t do it. I mean, I get why it feels like a lot of overhead when you’re just trying to ship something.

But that’s exactly the gap attackers are counting on.

Where the attacks actually come from

Public training datasets are scraped from the web. The web is not curated. The expired domain thing I mentioned at the start that’s real, documented. Anyone can do it. The datasets used to train big models include stuff from all over, and some of that “all over” is now controlled by people who weren’t involved when the dataset was originally put together.

Third-party fine-tuning data is another big one. Companies buy or license training data from vendors. How was that data collected? Who checked it? Usually the answer is: not very thoroughly.

Hugging Face is a whole separate problem. People upload model weights to public repositories and other people download them and use them as a starting point for their own models. Some of those uploaded weights are poisoned. Some of them include malicious code in the file format itself old pickle-based model files can execute arbitrary code when you load them, which is a whole different kind of nightmare.

And then there are insider threats. Someone with access to your training pipeline who decides they want to do something bad with it. This is honestly the hardest to defend against because by definition they’re already inside.

Big models vs. small models — who’s actually at risk

Here’s something slightly counterintuitive. GPT-4 or Claude or the giant foundation models are actually harder to poison at scale. They’re trained on trillions of words. To meaningfully affect the model’s behavior, you’d need to control a big chunk of that data which is extremely hard to pull off.

But fine-tuned models? A specialized model trained on ten thousand examples? A few hundred poisoned examples in that dataset is a much bigger percentage. The effect is way more pronounced.

And the whole industry is moving toward fine-tuning right now. Every company wants their own customized model. Smaller datasets, more focused training, faster iteration. Which means more attack surface, and usually less security expertise on the team doing the fine-tuning. The risk is moving exactly where it’s hardest to handle.

What you can actually do about it

Data provenance is the big one. Know where your training data came from. Have a clear chain of custody. If you can’t answer “who created this data and how,” that’s a problem. Prefer verified sources over scraped web data when you can.

There are also tools for backdoor detection STRIP, Neural Cleanse, Activation Clustering that can catch some patterns in already-trained models. I’ll be honest, these are still research tools more than production tools, and they don’t catch everything. But they’re better than nothing, and using them at all puts you ahead of most teams.

Differential privacy during training is another option. The idea is that it limits how much any single training example can pull the model’s behavior. So even if poisoned examples get in, their influence is smaller. The tradeoff is it can slightly hurt overall performance, so teams don’t always want to do it.

For model weights specifically: stop loading pickle files from untrusted sources. Just don’t. The safetensors format exists, it’s been around since 2022, and it doesn’t execute code on load. There’s basically no reason to use pickle anymore for model weights unless you’re working with something very old.

Actually, that last point is one where I’ve seen people be weirdly resistant. “Oh it’s fine, I know where I got it.” You don’t know who uploaded it or what’s in it.

The uncomfortable thing about all of this

The part that I keep coming back to is the timeline. With most security problems, there’s a moment where something goes wrong and you can respond. A server gets breached, you detect the intrusion, you patch it. With data poisoning, the attack happens before the model exists. By the time anyone is using the model, the vulnerability was baked in months ago during training. You’re not patching a running system. You’re discovering that the foundation was compromised before construction even started.

That expired domain attack from 2023 still bothers me. Whoever ran those datasets probably had no idea. Their pipeline kept working, their model kept training, everything looked normal. The bad data just sat there, quietly getting learned.

This is what makes it different from most AI security concerns people talk about. Jailbreaking, prompt injection, model extraction those all attack the model after it’s deployed. Poisoning attacks the model before it’s even born. And we don’t have great answers for it yet. The research is active, the tools are getting better, but right now “be careful where your training data comes from” is basically the whole defense.

Which is, honestly, not that different from “be careful what you eat.” Simple advice. Hard to follow at scale.