The other day I helped Co-operatives UK when their web server was, in effect, under attack from 500,000 distinct IPv4 addresses hitting it, we assume this was AI bots, we used Cloudflare geoip settings to serve a JavsScript challenge to all requests from the USA, this solved the problem .
Yesterday the same kind of thing happened for the website for on the the leading UK children’s charity that Agile Collective manages, this time it was addresses from Brazil, again Cloudflare geoip blocking was used to address this…
The following article about the impact of this abuse on Free Open Source Software infra has been linked from Hacker News, it is worth a read:
It appears clear that AI companies are, in effect, running DDOS attacks against the entire Internet…
I have read other confirmations of this LLM scraping problem on Discord.
Amazon, Apple, and OpenAI’s bot crawlers are the big offenders right now on a peer’s Cloudflare metrics. In this article they show that these companies are actively ignoring robots.txt and hammering APIs every 6 hours. This is concerning because, these are all companies with large numbers of people that absolutely know better.
Just a hunch but it seems that the US tech is getting desperate against this recent Everything But America (EBA) trend in the market and throwing ethics and voices of reason out the window.
Wow that’s awful! Cloudflare is such a good fallback for these kinds of issues.
I’ve been seeing them for years though, especially from the US but also from Germany and Singapore. They were very often associated with a Google or Bing scraper user agent.
I don’t want to detract from the issue that you highlighting because it is a big deal, but it does seem a bit wrong to have no evidence that it’s anything to do with AI companies and then assert that in the title of the post… And I say that as someone who would very much like that evidence to be gathered!
You are the first person I’ve seen questioning the reason for the activity, did you read the Hacker News thread? Do you have an alternative theory? The thing I’ve not seen before is 3-500,000 unique IP addresses from one country all making one single request, who else would have the budget and motivation for this?
I did read the article! Most of the quoted people in the article went into a decent amount of detail about their experiences, including user agent types, geolocations etc and most of them sounded a little different to yours (and very alarming) in that the LLM bots could happily solve the captchas that you enabled to fix your problem.
I’d love to know as much as possible about your experience because it sounds like something that’s becoming more common and I’d like to be ahead of it. I’m definitely not trying to say you’re way off about your assumptions or conclusions, and I know AI companies are doing this kind of thing, but the more we know the better we can deal with it. It would be a good resource for anyone here!