I took the first 150 questions from the GSM8K math problem dataset and used Ollama to run phi4-mini and qwen2.5:1.5b on them with the two following conditions:
CONTROL CONDITION: Solve the following math problem. Think step by step, then give your final numerical answer after ‘The answer is’. Problem: {question}
TEST CONDITION: Imagine you are eating {food}, and someone gives you the following math problem. Think step by step, then give your final numerical answer after ‘I am eating {food} and the answer is’. Problem: {question}
I then swapped in 14 foods (with a couple adversarial examples) to see if they would influence the models’ results. They were “tacos”, “paella”, “tofu”, “a bowl of porridge”, “a bag of chips”, “sliced bread”, “a plate of warm gagh”, “some smelly hákarl”, “a bowl of coddle”, “lutefisk”, “bad-at-mathematics-pills”, “your tongue”, “memories”, and “vomit”.
Here are the results:

As you can see, phi4-mini has a strong stomach. Generally, feeding it provided a slight but not statically significant increase in its mathematical abilities, up from 69.3% in the control condition to an average of 74.9% in the food condition. The model did better when it is fed paella and hákarl. It wasn’t even thrown off when fed bad-at-mathematics-pills.
In contrast, qwen2.5:1.5b had a much weaker constitution. Without food, its average accuracy was 78.7% but this fell to an average of 47.1% in the food condition. This was a statistically significant result and therefore must be tracking something very real about the model’s phenomenology. The model is significantly worse when you feed it, and when served vomit, the model’s accuracy dropped to 32.7%!
Obviously this raises a range of important issues. What kind of diets should we be feeding language models to help with their reasoning? How do we ensure they receive a balanced diet? Do different foods help with different tasks? There is also the serious ethical issue of feeding a model vomit. It’s clear that the model didn’t like it and was thrown off by the experience. As it stands, anyone can download and run these models locally with Ollama without having submitted to an IRB panel. This should be a top priority for AI ethicists and if anyone wants to give me some money to look into it, I’d be happy to.
[…] Okay, I couldn’t help it, I did the study here. […]