Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • turmacar@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    5 hours ago

    Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

    A sample size of 10 is nothing.

    Frankly would like to see some error bars on the “human polling”. How many people rapiddata is polling are just hitting the top or bottom answer?