Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
I doubt it, LLMs have already become significantly more efficient and powerful in just the last couple months.
In a year or two we will be able to run something like Gemini 2.5 Pro on a gaming PC which right now requires a server farm.
Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
Those are previous gen models, here are the current gen models: https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf#page10
That’s one example, but what about other models? What you just did is called cherry picking, or selective evidence.