IETF hatching a new way to tame aggressive AI website scraping

For web publishers, stopping AI bots from scraping their best content while consuming valuable bandwidth must feel somewhere between futile and nigh impossible.

It’s like throwing a cup of water at a forest fire. No matter what you try, the new generation of bots keeps advancing, insatiably consuming data to train AI models that are currently in the grip of competitive hyper-growth.

But with traditional approaches for limiting bot behavior, such as a robots.txt file, looking increasingly long in the tooth, a solution of sorts might be on the horizon through work being carried out by the Internet Engineering Task Force (IETF) AI Preferences Working Group (AIPREF).

The AIPREF Working Group is meeting this week in Brussels, where it hopes to continue its work to lay the groundwork for a new robots.txt-like system for websites that will signal to AI systems what is and isn’t off limits.

The group will try to define two mechanisms to contain AI scrapers, starting with “a common vocabulary to express authors’ and publishers’ preferences regarding use of their content for AI training and related tasks.”

Second, it will develop a “means of attaching that vocabulary to content on the internet, either by embedding it in the content or by formats similar to robots.txt, and a standard mechanism to reconcile multiple expressions of preferences.”

AIPREF Working Group Co-chairs Mark Nottingham and Suresh Krishnan described the need for change in a blog post:

“Right now, AI vendors use a confusing array of non-standard signals in the robots.txt file and elsewhere to guide their crawling and training decisions,” they wrote. “As a result, authors and publishers lose confidence that their preferences will be adhered to, and resort to measures like blocking their IP addresses.”

The AIPREF Working Group has promised to turn its ideas around the biggest change to the way websites signal their preferences since robots.txt was first used in 1994 into something concrete by mid-year.

Parasitic AI

The initiative comes at a time when concern over AI scraping is growing across the publishing industry. This is playing out differently across countries, but governments keen to encourage local AI development haven’t always been quick to defend content creators.

In 2023, Google was hit by a lawsuit, later dismissed, alleging that its AI had scraped copyrighted material. In 2025, UK Channel 4 TV executive Alex Mahon told British MPs that the British government’s proposed scheme to allow AI companies to train models on content unless publishers opted out would result in the “scraping of value from our creative industries.”

At issue in these cases is the principle of taking copyrighted content to train AI models, rather than the mechanism through which this is achieved, but the two are, arguably, interconnected.

Meanwhile, in a separate complaint thread, the Wikimedia Foundation, which oversees Wikipedia, said last week that AI bots had caused a 50% increase in the bandwidth consumed since January 2024 by downloading multimedia content such as videos:

“This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models,” the Foundation explained.

“This high usage is also causing constant disruption for our Site Reliability team, who has to block overwhelming traffic from such crawlers before it causes issues for our readers,” Wikimedia added.

AI crawler defenses

The underlying problem is that established methods for stopping AI bots have downsides, assuming they work at all. Using robots.txt files to express preferences can simply be ignored, as it has been by traditional non-AI scrapers for years.

The alternatives — IP or user-agent string blocking through content delivery networks (CDNs) such as Cloudflare, CAPTCHAS, rate limiting, and web application firewalls — also have disadvantages.

Even lateral approaches such as ‘tarpits’ — confusing crawlers with resource-consuming mazes of files with no exit links — can be beaten by OpenAI’s sophisticated AI crawler. But even when they work, tarpits also risk consuming host processor resources.

The big question is whether AIPREF will make any difference. It could come down to the ethical stance of the companies doing the scraping; some will play ball with AIPREF, many others won’t.

Cahyo Subroto, the developer behind the MrScraper ‘’ethical” web scraping tool, is skeptical:

“Could AIPREF help clarify expectations between sites and developers? Yes, for those who already care about doing the right thing. But for those scraping aggressively or operating in gray areas, a new tag or header won’t be enough. They’ll ignore it just like they ignore everything else, because right now, nothing’s stopping them,” he said.

According to Mindaugas Caplinskas, co-founder of ethical proxy service IPRoyal, rate limiting through a proxy service was always likely to be more effective than a new way of simply asking people to behave.

“While [AIPREF] is a step forward in the right direction, if there are no legal grounds for enforcement, it is unlikely that it will make a real dent in AI crawler issues,” said Caplinskas.

“Ultimately, the responsibility for curbing the negative impacts of AI crawlers lies with two key players: the crawlers themselves and the proxy service providers. While AI crawlers can voluntarily limit their activity, proxy providers can impose rate limits on their services, directly controlling how frequently and extensively websites are crawled,” he said.

However. Nathan Brunner, CEO of AI interview preparation tool Boterview, pointed out that blocking AI scrapers might create a new set of problems.

“The current situation is tricky for publishers who want their pages to be indexed by search engines to get traffic, but don’t want their pages used to train their AI,” he said. This leaves publishers with a delicate balancing act, wanting to keep out the AI scrapers without impeding necessary bots such as Google’s indexing crawler.

“The problem is that robots.txt was designed for search, not AI crawlers. So, a universal standard would be most welcome.”

Posts Similares

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *