Debates over AI benchmarking have reached Pokémon

Not even Pokémon is secure from AI benchmarking controversy.

Final week, a post on X went viral, claiming that Google’s newest Gemini mannequin surpassed Anthropic’s flagship Claude mannequin within the authentic Pokémon online game trilogy. Reportedly, Gemini had reached Lavender City in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

Gemini is actually forward of Claude atm in pokemon after reaching Lavender City

119 dwell views solely btw, extremely underrated stream pic.twitter.com/8AvSovAI4x

— Jush (@Jush21e8) April 10, 2025

However what the submit failed to say is that Gemini had a bonus.

As users on Reddit identified, the developer who maintains the Gemini stream constructed a customized minimap that helps the mannequin determine “tiles” within the recreation like cuttable bushes. This reduces the necessity for Gemini to research screenshots earlier than it makes gameplay selections.

Now, Pokémon is a semi-serious AI benchmark at greatest — few would argue it’s a really informative take a look at of a mannequin’s capabilities. Nevertheless it is an instructive instance of how totally different implementations of a benchmark can affect the outcomes.

For instance, Anthropic reported two scores for its current Anthropic 3.7 Sonnet mannequin on the benchmark SWE-bench Verified, which is designed to judge a mannequin’s coding talents. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, however 70.3% with a “customized scaffold” that Anthropic developed.

Extra not too long ago, Meta fine-tuned a model of considered one of its newer fashions, Llama 4 Maverick, to carry out properly on a specific benchmark, LM Enviornment. The vanilla version of the mannequin scores considerably worse on the identical analysis.

Provided that AI benchmarks — Pokémon included — are imperfect measures to start with, customized and non-standard implementations threaten to muddy the waters even additional. That’s to say, it doesn’t appear probably that it’ll get any simpler to match fashions as they’re launched.

Source link

Debates over AI benchmarking have reached Pokémon

Leave a Reply Cancel reply

About Us

Quick Links

Latest News