• m532@lemmygrad.ml
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    1 day ago

    Ever heard of reusing data? Its not the AI wildwest anymore. Scraping random data gives low quality (try SD1.5 to see what I mean). Good models need high-quality datasets.

    • algernon@lemmy.ml
      link
      fedilink
      arrow-up
      4
      arrow-down
      3
      ·
      1 day ago

      I wonder why scrapers hit my sites with millions of requests every day. Alibaba in particular is quite aggressive there.

        • algernon@lemmy.ml
          link
          fedilink
          arrow-up
          3
          ·
          17 hours ago

          Here you go. Daily stats from my defense system. All those disguised bots? ~60% of them are from Alibaba’s ASN.

          It is easy to verify, too: throw up any https site, and all the crawlers will be on your neck within days.

          There is a reason why Anubis’s botPolicies.yaml includes Alibaba. There’s a reason why a whole lot of sites - Codeberg included - blocks their entire ASN on the firewall.

          You’re welcome.

          • m532@lemmygrad.ml
            link
            fedilink
            arrow-up
            1
            arrow-down
            1
            ·
            edit-2
            12 hours ago

            It seems like I was wrong and they do need more data. But I think they have every right to go into their enemy’s imperialism tool and disrupt it however they see fit.