The Law of Small Numbers: Developing Accurate Estimates with Limited Data

The Law of Small Numbers: Developing Accurate Estimates with Limited Data

Parametric methods have traditionally used classical frequentist statistical methods. Frequentist statistics is the classical method that uses a sample of data as inputs. If you have taken Statistics 101 in college, most if not all the class was oriented towards this approach. For example, traditional linear and nonlinear regression analysis is a frequentist approach. The challenge with frequentist statistics is that it requires a large amount of data. Statisticians who have conducted simulation studies using random data have concluded that 50 data points are needed for a regression analysis with 10 additional data points for every independent variable.

The highly specialized systems used in the Department of Defense and NASA means that there is typically nowhere near that much data available. For example, the Missile Defense Agency has only developed a handful of different kill vehicles, and NASA has only developed a few crewed launch vehicles. When looking at truly applicable data, the sample size shrinks even further – when considering launch vehicles, the primary systems that NASA has completed have been those for the Apollo and Shuttle programs. The Apollo program began in the 1960s, and the Shuttle program began in the 1970s. Thus, there are no directly applicable historical data points within the last 40 years. Considering the changes that have taken place in the realm of technology since then, there no applicable historical data at all for these systems.

The Law of Small Numbers is the belief that large sample methods and rules (like the Law of Large Numbers) apply to small data sets. This common belief is problematic when applying traditional statistics to the development of cost estimating relationships for small data sets – as it leads to inaccurate estimates that are based more on noise than signal.

For small data sets like these, Bayesian methods can help improve accuracy. Bayesian methods leverage all your experience, making them less subject to being overwhelmed by noise. This prior experience can be subjective or objective.

Bayesian methods have proven to be successful in a multitude of applications. Bayesian techniques were used in World War II to help crack the Enigma code used by the Germans, thus helping to shorten the war. John Nash’s equilibrium for games with incomplete or imperfect information is a form of Bayesian analysis (John Nash’s life was portrayed in the film A Beautiful Mind). Actuaries have used Bayesian methods for over 100 years to set insurance premiums. Bayesian voice recognition researchers applied their skills as leaders of the portfolio and technical trading team for the Medallion Fund, a $5 billion hedge fund which has averaged annual returns of 35% after fees since 1989.

This paper introduces the Bayesian method and shows in detail how it can be applied to regression. The basic Gaussian framework is provided as a starting point and is explained in a straightforward and intuitive way. In the case of limited sample data, there are two key assumptions in the standard Gaussian linear model that are dubious. One is that the variance of the estimating equation is known and is equal to the estimating equation variance based on the sample data. A second is that the residuals of the estimating equation derived from the sample data follow a Gaussian distribution. Neither of these assumptions is valid for small samples. The assumption of known variance is relaxed and an analytical method for conducting Bayesian analysis is derived. Next the assumption of Gaussian residuals is relaxed, and are modeled with a Student’s t distribution instead, as is typically done for small samples. In this case, with both assumptions changed there is no analytical solution possible. Markov Chain Monte Carlo simulation is provided as a technique to overcome this.

The paper provides a single practical example throughout the paper. R code that implements the Markov Chain Monte Carlo simulation for the example is provided.

View the presentation.

Go Back

Related Resources

The Organizational Risks of not performing Robust Should Cost Analysis

Using SEER Space to Estimate Constellations of Satellites

Using the Rule of 166 to Allocate the Cost of Space Mission Cost