Home
Blog
Large language models struggle with arithmetic

Large language models struggle with arithmetic

AI
October 17, 2024

About 3 calendar months ago ( about 16 dog years), I needed to do a Monte Carlo analysis. I thought it would be interesting to use AI. So I asked chatGPT, “Can you do a Monte Carlo analysis?” Yes was the answer. So I gave chatGPT the parameters and dragged in the file. chatGPT generated code to perform the analysis then executed it. Very cool. Then, out came a clearly wrong answer.

Upon analyzing, I saw it eliminated anything that was 100% probable, insisting that was correct (it’s not). I gave it more direction. ChatGPT refused to include those items that were a 100% probability. So I said “do the monte Carlo, then add-in those items with 100% probability.” Next, I asked for a cumulative probability distribution. It gave me a straight line. I told it an s curve was more appropriate and it did so. I did run it a few times and saw the expected variance in the results.

What is my point: large language models struggle with arithmetic. I hear some people say they can replace hundreds of effort years of data collection and analysis with “AI.” Most don’t consider where the “data” is coming from nor that modeling is far more than simple regression.

Apple just released an excellent paper on Understanding the limitations of Mathematical Reasoning in Large Language Models. The bottom line is that the GSM8k model assesses large language models’ mathematical abilities compared to an eighth-grader. They point out that while the LLMs seem to be improving on this test ,the models themselves do not appear to have improved mathematical reasoning. They concluded the test was wrong (hmmm, dealing with cases making the test perform better. Sounds like processor benchmark tricks) and came up with a better test of mathematical skills.

Their conclusion:

“Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer.

Of course there are libraries, such as Google TensorFlow, or Wolfram Alpha, that can perform mathematics. But it isn’t coming from today’s LLMs.

Still, Galorath has found that the LLMs can do an excellent job driving language of the parameters to cost, schedule, risk models such as SEER.

This is very exciting, providing viable inputs to proven estimation equations. Stay tuned for additional conclusions on the topic.

10 Step Estimation Process Sample Checklist

View our 10 Step Estimating Process Checklist. This checklist should be tuned to the individual company’s needs and suggestions.

Cost Analysis

Estimating Total Cost of Ownership (TCO)

Find out how you can use Total Cost of Ownership (TCO) model to create an estimate which includes all the costs generated over the useful life of a given application.

Cost Analysis

Should Cost Analysis

Learn how Should-Cost Analysis can identify savings opportunities and drive cost efficiency in procurement and manufacturing processes.

Should-Cost

Code lines on a screen for cost analysis.

ROM Estimate: The First Step Towards a Detailed Project Plan

Find out what ROM (rough order of magnitude) estimate is and why is it a crucial element of every project planning cycle.

Cost Analysis

Software Maintenance Cost

Find out why accurate estimation of software maintenance costs is critical to proper project management, and how it can make up to roughly 75% of the TCO.

Software