• Dave@lemmy.nz
    link
    fedilink
    English
    arrow-up
    41
    ·
    18 hours ago

    Running an instance without cloudflare in front is hard work, because AI scrapers bring it to it’s knees. It’s a never ending battle to block them even with Cloudflare, at least Cloudflare can help reduce the load, and even the free version comes with many tools to identify and block problematic bots.

    Though if you turn on bot blocking you break federation, so you have to be a lot more refined in your security rules.

      • Dave@lemmy.nz
        link
        fedilink
        English
        arrow-up
        3
        ·
        10 hours ago

        Yeah so anubis is like a Cloudflare challenge, it fits in to a certain part of the process.

        My point is basically that Cloudflare provides a service that stands in for many things an admin could be doing. There are many instances that don’t use Cloudflare, and I commend them for that. It’s more work but certainly possible.

        There’s also the additional problem that AI bots are breaking through anubis so it can’t be the only line of defence.

        E.g. https://news.ycombinator.com/item?id=44914773

      • Dave@lemmy.nz
        link
        fedilink
        English
        arrow-up
        8
        ·
        edit-2
        13 hours ago

        Cloudflare’s bot detection triggers the blocking because federation looks a lot like a bot (well, it is a bot).

        For example, Lemmy.world will send my instance hundreds of thousands if not millions of requests a day, in a near steady stream. It’s telling my instance about every post, comment, or vote. AI scrapers send hundreds of thousands of requests or millions in a near steady stream each day.

        For all intents and purposes, federation is bot traffic and looks just like it. Typically I block by identifying high traffic ASNs (a group of IPs run by the same entity, because blackhat AI scrapers use many IPs) and showing a cloudflare challenge (which will typically have a 0% pass rate). If it’s from 1IP then it’s probably a federated instance, but I typically see many IPs from the same area spread with an even spread of requests.

        I also try to exclude federation/API endpoints, which can help stop false positives as scrapers are generally loading the web page.

        This is something Lemmy (and PieFed, Mbin) admins try to help each other with strategies for because one day a bot will find you and suddenly your instance is down because they are hammering you too hard.

        I bet if you are in China, Brazil, Singapore, Argentina, etc then you will see a lot of blocked content on Lemmy, as this is often where the bot traffic comes from (Google, Facebook, OpenAI, Amazon, etc will typically respect the robots.txt so US traffic is less of an issue).

        • Cooper8@feddit.online
          link
          fedilink
          English
          arrow-up
          4
          ·
          13 hours ago

          The thing that confuses me is, wouldn’t a whitelist for federated instances and request frequency throttling at the account level solve this issue?

          I suppose this would require that the client not have a public front end that keeps full navigation functionality, but for a smaller instance that seems like an easy sacrifice to make in exchange for stability.

          “But then how will new instances get federated?” maybe they have to actually talk to the admins of other instances to get vouched in to the whitelist. Just because the network is distributed doesnt mean it needs to be fully inclusive by default, and in fact it explicitly isn’t.

          I’m assuming I’m missing something super basic that makes all this not enough, bots spoofing the requests with the credentials of a whitelisted instance maybe?

          Seems like maybe the instances should have encrypted keys that handshake each other with batch requests.

          Am I on to something or just wildly gesticulating?

          • Dave@lemmy.nz
            link
            fedilink
            English
            arrow-up
            3
            ·
            10 hours ago

            There are thousands of instances and it’s not really about admins. If a Mastodon user wants to go and follow a Lemmy community, they can. They shouldn’t need to ask their admin to contact the admin of the Lemmy instance to be allowed to.

            However, there is something called Fediseer which allows a chain of trust. Some instances guarantee other instances who then guarantee others down a chain. If an instance turns out bad then their guarantor can revoke it and any instances lower in the chain (that the spammy instance guarantees) also lose their trusted status. It doesn’t share IPs to my knowledge though, and outbound IPs are different than the inbound one on the domain if there is a CDN like Cloudflare in the mix. The intent is actually to identify and block instances set up to spam (or other reasons to defederate).

            I think the other part missing is that it’s not just instances. If you upload an image to Lemmy.world and then someone on feddit.online views it, the feddit.online user’s browser loads that image directly from Lemmy.world. That means if you block any IP that’s not an instance, people won’t be able to see content uploaded by your users. So you have to be able to tell what is a Brazil-hosted AI bot and what’s a Brazilian user viewing a meme your user uploaded.

            There are of course different parts that you can or can’t block which is basically the idea, working out which endpoints can be blocked and which will break things for genuine users. With static images they can be basically ignored because Cloudflare will cache it, but having thousands of post or feed loads in a hurry can bring down an instance.

      • Dave@lemmy.nz
        link
        fedilink
        English
        arrow-up
        11
        ·
        16 hours ago

        Cloudflare has a generous free tier. I think thats why it got so popular.

        • dohpaz42@lemmy.world
          link
          fedilink
          English
          arrow-up
          7
          ·
          edit-2
          16 hours ago

          Begs the question; when will it go the route of other services with its generous free tier?

          • Dave@lemmy.nz
            link
            fedilink
            English
            arrow-up
            5
            ·
            edit-2
            15 hours ago

            A good chance. Depends on if they think the free tier is still stacking up for them.

            E.g. getting their name out there with hobbyists means people recognise the name at work and have staff already familiar, is this still important? Probably not, considering how widespread they are now.

            Being able to say in sales speeches they mitigate X billion DDOS attacks and X trillion GB of data saved etc, maybe that is still worth it to them to keep the free tier in order to win big contracts?

            Since they dropped their no video streaming clause from the T&Cs of free accounts, I’m guessing they aren’t about to back down on the unlimited bandwidth but over time they are adding more and more value add premium features, which may be they core strategy.

            But I do not doubt that they will drop or enshittify the free tier as soon as they think it’s the best strategic move.