Dev Tools Synthesized from 1 source

One Developer's Poison Tool Turns AI Scrapers Into data Pollution

Key Points

• Miasma published on GitHub March 29, hit 269 HN points in 24 hours
• Tool redirects AI scrapers into infinite loops of generated fake text
• Runs as server-side middleware, no website code changes required
• Open-source release lets any publisher deploy anti-scraping defenses
• Publishers can waste scraper compute and contaminate training data

References (1)

[1] Miasma lets sites poison AI scrapers with fake data — Hacker News AI ↗

A lone developer with a Python script and a vendetta against AI companies quietly shipped a weapon last week. Austin Weeks published Miasma on GitHub on March 29th. Within 24 hours, the tool sat at 269 points on Hacker News, sparked 202 comments, and fundamentally changed the economics of web scraping. The weapon's name is Miasma. Its function is elegant: it makes any website poison AI training pipelines by feeding scrapers an infinite stream of fabricated content.

The arms race between AI companies and website publishers has been one-sided for years. AI labs vacuum up the internet through services like Common Crawl, which has archived over 3 billion web pages. OpenAI, Google, and Anthropic all train on this data. Publishers had two options: sue (expensive, slow, often futile) or accept that their work trained models they never consented to. Miasma adds a third option.

The tool operates as middleware. Install it on a web server, point it at your content, and it watches incoming traffic. When it detects a scraper—identified through request patterns, bot signatures, and behavioral signals—it doesn't block access. Instead, it redirects the bot into an infinite loop of generated text. The scraper thinks it's ingesting your site. It is not.

The technical implementation matters because it runs entirely on the server side. Miasma generates semantic content that reads like coherent English prose but contains no real information. Current AI scrapers cannot easily distinguish this from legitimate text because the generated output passes basic plagiarism checks and maintains consistent tone. Traditional defenses like rate limiting or robots.txt directives simply prevent access—Miasma actively wastes the scraper's computational resources while filling its training pipeline with noise.

This is the key insight: the poisoned data still gets processed. A scraper running through Miasma's content loop still burns bandwidth, CPU cycles, and storage. At scale, this becomes expensive. One HN commenter estimated that widespread deployment could cost AI companies millions in wasted compute. Publishers running the tool also consume the scraper's bandwidth and storage capacity while their actual site remains accessible to human visitors.

Not everyone is convinced the strategy works. Critics argue that AI training pipelines are sophisticated enough to filter low-quality or repetitive content. If the fake data is obvious to algorithms, it gets discarded without harm. This may be true for naive implementations, but Miasma's developers claim the generated text is specifically designed to pass statistical quality filters. Whether that claim holds under scrutiny remains an open question.

What is clear is that the tool lowers the barrier to fight back. Before Miasma, only organizations with significant engineering resources could deploy sophisticated anti-scraping measures. Now any website operator with access to a server can join the resistance. The tool's open-source release on GitHub means the code is free to inspect, modify, and deploy without requesting permission from anyone.

The developers behind Miasma see themselves as restoring balance. "AI companies have built billion-dollar businesses on data they took without asking," one contributor noted in the project's documentation. Whether poisoning training data constitutes ethical pushback or undermines the open web's fundamentals depends entirely on your position in this conflict. Both sides have legitimate grievances.

What is not debatable is that Miasma marks a turning point. The technology for website-level defense against AI scraping has shifted from theoretical to practical in under a week. As more publishers deploy tools like this, AI labs face an increasingly contaminated data landscape. The question for 2026 is not whether this arms race will continue—it is whether the web's original architecture can survive it.