Cloudflare’s bot detection triggers the blocking because federation looks a lot like a bot (well, it is a bot).
For example, Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It’s telling my instance about every post, comment, or vote. AI scrapers send hundreds of thousands of requests or millions in a near steady stream each day.
For all intents and purposes, federation is bot traffic and looks just like it. Typically I block by identifying high traffic ASNs (a group of IPs run by the same entity, because blackhat AI scrapers use many IPs) and showing a cloudflare challenge (which will typically have a 0% pass rate). If it’s from 1IP then it’s probably a federated instance, but I typically see many IPs from the same area spread with an even spread of requests.
I also try to exclude federation/API endpoints, which can help stop false positives as scrapers are generally loading the web page.
This is something Lemmy (and PieFed, Mbin) admins try to help each other with strategies for because one day a bot will find you and suddenly your instance is down because they are hammering you too hard.
I bet if you are in China, Brazil, Singapore, Argentina, etc then you will see a lot of blocked content on Lemmy, as this is often where the bot traffic comes from (Google, Facebook, OpenAI, Amazon, etc will typically respect the robots.txt so US traffic is less of an issue).
The thing that confuses me is, wouldn’t a whitelist for federated instances and request frequency throttling at the account level solve this issue?
I suppose this would require that the client not have a public front end that keeps full navigation functionality, but for a smaller instance that seems like an easy sacrifice to make in exchange for stability.
“But then how will new instances get federated?” maybe they have to actually talk to the admins of other instances to get vouched in to the whitelist. Just because the network is distributed doesnt mean it needs to be fully inclusive by default, and in fact it explicitly isn’t.
I’m assuming I’m missing something super basic that makes all this not enough, bots spoofing the requests with the credentials of a whitelisted instance maybe?
Seems like maybe the instances should have encrypted keys that handshake each other with batch requests.
Am I on to something or just wildly gesticulating?
There are thousands of instances and it’s not really about admins. If a Mastodon user wants to go and follow a Lemmy community, they can. They shouldn’t need to ask their admin to contact the admin of the Lemmy instance to be allowed to.
However, there is something called Fediseer which allows a chain of trust. Some instances guarantee other instances who then guarantee others down a chain. If an instance turns out bad then their guarantor can revoke it and any instances lower in the chain (that the spammy instance guarantees) also lose their trusted status. It doesn’t share IPs to my knowledge though, and outbound IPs are different than the inbound one on the domain if there is a CDN like Cloudflare in the mix. The intent is actually to identify and block instances set up to spam (or other reasons to defederate).
I think the other part missing is that it’s not just instances. If you upload an image to Lemmy.world and then someone on feddit.online views it, the feddit.online user’s browser loads that image directly from Lemmy.world. That means if you block any IP that’s not an instance, people won’t be able to see content uploaded by your users. So you have to be able to tell what is a Brazil-hosted AI bot and what’s a Brazilian user viewing a meme your user uploaded.
There are of course different parts that you can or can’t block which is basically the idea, working out which endpoints can be blocked and which will break things for genuine users. With static images they can be basically ignored because Cloudflare will cache it, but having thousands of post or feed loads in a hurry can bring down an instance.
Cloudflare’s bot detection triggers the blocking because federation looks a lot like a bot (well, it is a bot).
For example, Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It’s telling my instance about every post, comment, or vote. AI scrapers send hundreds of thousands of requests or millions in a near steady stream each day.
For all intents and purposes, federation is bot traffic and looks just like it. Typically I block by identifying high traffic ASNs (a group of IPs run by the same entity, because blackhat AI scrapers use many IPs) and showing a cloudflare challenge (which will typically have a 0% pass rate). If it’s from 1IP then it’s probably a federated instance, but I typically see many IPs from the same area spread with an even spread of requests.
I also try to exclude federation/API endpoints, which can help stop false positives as scrapers are generally loading the web page.
This is something Lemmy (and PieFed, Mbin) admins try to help each other with strategies for because one day a bot will find you and suddenly your instance is down because they are hammering you too hard.
I bet if you are in China, Brazil, Singapore, Argentina, etc then you will see a lot of blocked content on Lemmy, as this is often where the bot traffic comes from (Google, Facebook, OpenAI, Amazon, etc will typically respect the robots.txt so US traffic is less of an issue).
Thank you for the detailed response :) i even understood most of it
The thing that confuses me is, wouldn’t a whitelist for federated instances and request frequency throttling at the account level solve this issue?
I suppose this would require that the client not have a public front end that keeps full navigation functionality, but for a smaller instance that seems like an easy sacrifice to make in exchange for stability.
“But then how will new instances get federated?” maybe they have to actually talk to the admins of other instances to get vouched in to the whitelist. Just because the network is distributed doesnt mean it needs to be fully inclusive by default, and in fact it explicitly isn’t.
I’m assuming I’m missing something super basic that makes all this not enough, bots spoofing the requests with the credentials of a whitelisted instance maybe?
Seems like maybe the instances should have encrypted keys that handshake each other with batch requests.
Am I on to something or just wildly gesticulating?
There are thousands of instances and it’s not really about admins. If a Mastodon user wants to go and follow a Lemmy community, they can. They shouldn’t need to ask their admin to contact the admin of the Lemmy instance to be allowed to.
However, there is something called Fediseer which allows a chain of trust. Some instances guarantee other instances who then guarantee others down a chain. If an instance turns out bad then their guarantor can revoke it and any instances lower in the chain (that the spammy instance guarantees) also lose their trusted status. It doesn’t share IPs to my knowledge though, and outbound IPs are different than the inbound one on the domain if there is a CDN like Cloudflare in the mix. The intent is actually to identify and block instances set up to spam (or other reasons to defederate).
I think the other part missing is that it’s not just instances. If you upload an image to Lemmy.world and then someone on feddit.online views it, the feddit.online user’s browser loads that image directly from Lemmy.world. That means if you block any IP that’s not an instance, people won’t be able to see content uploaded by your users. So you have to be able to tell what is a Brazil-hosted AI bot and what’s a Brazilian user viewing a meme your user uploaded.
There are of course different parts that you can or can’t block which is basically the idea, working out which endpoints can be blocked and which will break things for genuine users. With static images they can be basically ignored because Cloudflare will cache it, but having thousands of post or feed loads in a hurry can bring down an instance.