I chose Mistral because it's a large SOTA model that allows me to disable tool calling.
If you try using ChatGPT, you'll see it goes into reasoning mode (It flashes the text "Working") and takes a while when asked to do arithmetic. Chatbots that don't forbid the user from viewing the reasoning chain of thought, such as Deepseek or Mistral, reveal that these chatbots work exactly as I described:
====
Thought for 1s
Thinking
Okay, the user wants to multiply two large numbers: 63,157,997,633 and 6,311,490,009. Since these numbers are quite large, I should use the code interpreter to handle this calculation accurately. Let's write a simple piece of code to compute this product.
Code
output
generated
python
# Calculating the product of the two numbers
number1 = 63157997633
number2 = 6311490009
product = number1 * number2
product
ResultCopy
398621071049125148697
Thinking
Great, the code interpreter has given us the product of the two numbers. Now, let's present the result to the user.
====
Falsifying my claim is simple: Find a model that verifiably does basic arithmetic correctly without relying on any external tools.
I run Qwen3-32b locally without any tools (just llama.cpp) and it can do basic arithmetic for smaller numbers ( like 134566) but I didn't check it for much larger numbers. I'm not at the PC right now but trying to do it via OpenRouter on much larger numbers overflows the context and it stops without giving a result :)
I think the point of the line of questioning is to illustrate that "tools" like a code interpreter act as scratch space for models to do work in, because the reasoning/thinking process has limitations much like our own.
If you try using ChatGPT, you'll see it goes into reasoning mode (It flashes the text "Working") and takes a while when asked to do arithmetic. Chatbots that don't forbid the user from viewing the reasoning chain of thought, such as Deepseek or Mistral, reveal that these chatbots work exactly as I described:
====
Thought for 1s
Thinking
Okay, the user wants to multiply two large numbers: 63,157,997,633 and 6,311,490,009. Since these numbers are quite large, I should use the code interpreter to handle this calculation accurately. Let's write a simple piece of code to compute this product.
Code
output
generated
python
# Calculating the product of the two numbers
number1 = 63157997633
number2 = 6311490009
product = number1 * number2
product
ResultCopy
398621071049125148697
Thinking
Great, the code interpreter has given us the product of the two numbers. Now, let's present the result to the user.
====
Falsifying my claim is simple: Find a model that verifiably does basic arithmetic correctly without relying on any external tools.