Understanding the Reality of AI Data Scraping
Artificial Intelligence isn’t a whimsical invention. The applications that can create essays or hyper-realistic videos from straightforward prompts owe their capabilities to vast training datasets. This data originates from various online sources, primarily authored by humans.
The internet is an enormous reservoir of information. Last year, the web was reported to house 149 zettabytes of data. To put that into perspective, that translates to 149 million petabytes, or 1.49 trillion terabytes, or 149 trillion gigabytes—essentially an astronomical amount. This diverse collection of text, images, videos, and audio files is highly sought after by AI companies striving to enhance and expand their models.
As a result, AI systems continuously scour the internet, collecting any accessible data to improve their neural networks. Recognizing the lucrative opportunity, certain firms have struck agreements to monetize their data with AI organizations, including entities like Reddit, the Associated Press, and Vox Media. The approach of AI companies often involves scraping data without prior consent, prompting a backlash from various organizations that have initiated legal actions against firms like OpenAI, Google, and Anthropic. (Notably, Ziff Davis, the parent company of DailyHackly, filed a lawsuit against OpenAI in April, claiming copyright infringement related to AI training operations.)
Such legal actions have not impeded the relentless data-collection efforts of AI systems. There is an increasing urgency for more data; recent studies indicate that AI models may exhaust their necessary data by 2028, limiting the window for AI companies to gather from the vast web. While alternative data sources like formal partnerships and synthetic data may provide some relief, the internet remains an invaluable asset for these companies.
For many users active online, it is likely their personal data has been harvested by these AI systems. Although this feels disconcerting, it is the very fuel driving the chatbots widely adopted over the past few years.
The Internet’s Resistance
Nonetheless, despite the pressing challenges facing the online landscape, there is a growing resistance against such practices. Particularly, there are efforts to shield smaller entities from the impact of AI data scraping.
In a remarkable display of ingenuity, a web developer has created a solution to prevent AI bots from indiscriminately harvesting data from their websites. The tool, known as Anubis, was launched earlier this year and has already been downloaded over 200,000 times.
Developed by Xe Iaso, who operates out of Ottawa, Canada, as reported by 404 Media, Anubis was inspired by her experience with an Amazon bot that was crawling her Git server. Rather than shutting down her server entirely, she experimented with various tactics and eventually created a method to effectively block such bots through what she refers to as an “uncaptcha.”
The operation of Anubis is straightforward: Upon activation on a website, the tool verifies new visitors as humans by utilizing cryptographic computations via JavaScript. Most modern browsers easily complete this task, whereas bots often lack the required coding capacities to perform such extensive cryptographic work en masse. This allows Iaso to block bots while seamlessly permitting genuine users.
This solution is tailored for web administrators rather than the typical internet user. Furthermore, it is entirely free and open-source, with ongoing enhancements planned. Iaso informed 404 Media that although she cannot dedicate her time entirely to Anubis, she is brainstorming updates, including a testing mechanism that minimizes CPU strain on users and an alternative that doesn’t rely on JavaScript, considering some users disable it for privacy reasons.
For those interested in deploying Anubis on their servers, detailed guidance is available on Iaso’s GitHub page. Additionally, users can test their own browsers to verify their identities against bot detection.
Iaso is not alone in this effort; Cloudflare recently began blocking AI crawlers by default and enabling clients to charge AI companies desiring to collect data from their platforms. As mechanisms become more effective in preventing AI firms from unimpeded data harvesting, it’s plausible these companies may retract their aggressive scraping strategies or, at a minimum, offer website owners greater compensation for their data.
The hope is to encounter more sites that initially display the Anubis verification screen. If navigating to a link leads to the “Verifying you are not a bot” notification, it signals that the site successfully safeguards against these AI crawlers. For a period, AI seemed insurmountable, but now there exists a capacity to challenge its unchecked expansion.