• supersquirrel@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    1 day ago

    In the realm of LLMs sabotage is multilayered, multidimensional and not something that can easily be identified quickly in a dataset. There will be no easy place to draw some line of “data is contaminated after this point and only established AIs are now trustable” as every dataset is going to require continual updating to stay relevant.

    I am not suggesting we need to sabotage all future endeavors for creating valid datasets for LLMs either, far from it, I am saying sabotage the ones that are stealing and using things you have made and written without your consent.

    • Grimy@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      2 days ago

      I just think the big players aren’t touching personal blogs or social media anymore and only use specific vetted sources, or have other strategies in place to counter it. Anthropic is the one that told everyone how to do it, I can’t imagine them doing that if it could affect them.

      • supersquirrel@sopuli.xyz
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        1 day ago

        Sure, but personal blogs, esoteric smaller websites and social media are where all the actual valuable information and human interaction happens and despite the awful reputation of them it is in fact traditional news media and associated websites/sources that have never been less trustable or useless despite the large role they still play.

        If companies fail to integrate the actual valuable parts to the internet in their scraping, the product they create will fail to be valuable past a certain point shrugs. If you cut out the periphery of the internet paradoxically what you accomplish is to cut out the essential core out of the internet.