Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
Those are previous gen models, here are the current gen models: https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf#page10
That’s one example, but what about other models? What you just did is called cherry picking, or selective evidence.