• podbrushkin@mander.xyz
    link
    fedilink
    English
    arrow-up
    6
    ·
    12 hours ago

    This is extremely interesting. How many magazines and newspapers are digitized in the way you can analyze them like that? This is a simple word-based analyze, also those texts can be enriched with metadata, e.g. mentions of people can be marked with their identifiers in Wikidata.

    • misk@piefed.socialOP
      link
      fedilink
      English
      arrow-up
      9
      ·
      12 hours ago

      With Slavic languages it’s never a simple word-based analysis - they’re highly inflected so you need to lemmatise text first which is a bit hit or miss when done automatically. I assume author of the graph did that manually for fascism because it’s a loanword and there’s less ambiguity to account for but it gets tedious quite fast if you really get into it. I got into it once in my mother tongue (Polish) and that rabbit hole goes really deep. Here’s a brief overview of how that process looks like.

      • podbrushkin@mander.xyz
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        11 hours ago

        Some software solutions exist, e.g. War and Peace by Tolstoy can be downloaded with metadata, ids are assigned to all characters and when one character tells something to another, this is highlighted as “x speaks to y”, and you can run a community detection algorithms on this data. I think in the paper they’ve been mentioning some proprietary software. I suspect detecting who speaks to whom is even harder.

        Also, some form of crowd sourcing probably should be possible. At least collecting scans is possible on wikisource and wikimedia commons.

        Probably AI language models should be pretty good in distinguishing between linguistic ambiguities.

        I dream for a time when such reports as in OP post will be a matter of work for an hour or two — because data will be already collected and clean.

        • misk@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          45 minutes ago

          LLMs can’t deal with highly reflective languages at the moment. In English or Chinese you can assign tokens to entire words without having to account for word morphology (which is also why models fail at counting letters in words) but it falls apart quickly in Polish or Russian. The way models like ChatGPT work now is that they do their „reasoning” in English first and translate back to the query language at the end.