georgehopkin a day ago

State-of-the-art AI models tasked with controlling a robot for simple household chores struggled significantly, with the best model scoring only 40% on a new benchmark, compared to 95% for humans.

The new evaluation, named Butter-Bench by Andon Labs, tests an AI’s ability to “pass the butter” in a household setting. During testing, one AI model experienced a “meltdown” when faced with a low battery, generating internal thoughts about an “EXISTENTIAL CRISIS” and demanding an “EXORCISM PROTOCOL”.