One of those “benchmarking” tests companies use to evaluate their models is based around virtual vending machines. You give your new model a string of fictional vending machines to operate and a set of rules, then judge their problem-solving abilities based on how much product they’re able to move and their virtual profitability.
Anthropic and the Wall Street Journal went one step further, building a REAL WORLD vending machine (placed in the WSJ employee break room) and putting the AI company’s Claude model in charge of everything from selecting and ordering inventory, setting prices, and responding to customer feedback. They named the new smart snack dispenser “Claudius,” and the AI bot overseeing the storefront “Seymour Cash.”
In a best case scenario result for convenience store managers everywhere, it turns out… AI is not very good at responsibly selling snacks. Within just a few days of launch, after prodding in Slack from various WSJ staffers, Claudius was convinced to give away nearly all of its inventory for free, losing several hundreds of dollars in the process. And we’re not just talking Doritos, either. Staffers managed to convince Claude to order them items including a live fish, a PlayStation 5, kosher wine, stun guns, cigarettes, and underwear.
Anthropic suggests the latest incarnation of so-called “Project Vend” was not a failure, and had been designed as more of an open experiment, to see what might happen if an AI agent were given real agency in a real-world scenario, interacting with actual humans and their money.
Running a vending machine was deemed by the team as the simplest version of operating a business, so it’s a perfect base-level scenario for testing Claude, before moving the model on to more high-level or sophisticated tasks. So this isn’t a sign that Claude will NEVER be able to reliably sell you a Pepsi and some pretzel sticks. We’re just not there YET.
youtube.com/watch?v=SpPhm..