Cost, Schedule, and Risk Analysis in 2025: Your Readiness Scorecard

Book a Consultation

Large language models struggle with arithmetic

  • October 17, 2024
img

About 3 calendar months ago ( about 16 dog years), I needed to do a Monte Carlo analysis. I thought it would be interesting to use AI. So I asked chatGPT, “Can you do a Monte Carlo analysis?” Yes was the answer. So I gave chatGPT the parameters and dragged in the file. chatGPT generated code to perform the analysis then executed it. Very cool. Then, out came a clearly wrong answer.

Upon analyzing, I saw it eliminated anything that was 100% probable, insisting that was correct (it’s not). I gave it more direction. ChatGPT refused to include those items that were a 100% probability. So I said “do the monte Carlo, then add-in those items with 100% probability.” Next, I asked for a cumulative probability distribution. It gave me a straight line. I told it an s curve was more appropriate and it did so. I did run it a few times and saw the expected variance in the results.

What is my point: large language models struggle with arithmetic. I hear some people say they can replace hundreds of effort years of data collection and analysis with “AI.” Most don’t consider where the “data” is coming from nor that modeling is far more than simple regression.

Apple just released an excellent paper on Understanding the limitations of Mathematical Reasoning in Large Language Models. The bottom line is that the GSM8k model assesses large language models’ mathematical abilities compared to an eighth-grader. They point out that while the LLMs seem to be improving on this test ,the models themselves do not appear to have improved mathematical reasoning. They concluded the test was wrong (hmmm, dealing with cases making the test perform better. Sounds like processor benchmark tricks) and came up with a better test of mathematical skills. 

Their conclusion:

“Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer.

Of course there are libraries, such as Google TensorFlow, or Wolfram Alpha, that can perform mathematics. But it isn’t coming from today’s LLMs.

Still, Galorath has found that the LLMs can do an excellent job driving language of the parameters to cost, schedule, risk models such as SEER.

This is very exciting, providing viable inputs to proven estimation equations. Stay tuned for additional conclusions on the topic.

Author Image
Dan Galorath Dan Galorath is a software developer, businessman, author, and founder and CEO of Galorath.

Every project is a journey, and with Galorath by your side, it’s a journey towards assured success. Our expertise becomes your asset, our insights your guiding light. Let’s collaborate to turn your project visions into remarkable realities.

BOOK A CONSULTATION