The test consists of 2,500 unique expert questions that cannot be solved by simple pattern matching in datasets. The results are sobering: top language models fail when confronted with complex, multi-step reasoning. This confirms the theory that synthetic data and extensive parameter scaling no longer provide an exponential growth in "intelligence." Without a radical change in architecture (shifting from token prediction to genuine logical inference), AGI will remain an unattainable marketing myth.
Source: ScienceDaily / Scale AI
ScienceBenchmarkLLMAGIScale AI