Benford’s Law

Take a look at a natural data set, e.g. the populations of all the counties in the United States.

Figure 1 – County Map

There are over 3,000 counties. If you look at all the 3000+ population counts, the first digit is 1 about 30% of the time, the first digit is 2 about 18% of the time, and 9 about 5% of the time.

Figure 2

The digit 1 shows up as a first digit about 6 times as often as 9. Amazing! This phenomenon is called Benford’s law.

Figure 3 – Benford’s Law

This phenomenon shows up in many other data sets. In addition to population data, it shows up in geographical data (e.g. areas of rivers), baseball statistics, numbers in magazine articles, numbers in street addresses, house prices and stock market trading data. Benford’s law also has practical applications in business because it shows up in corporate financial statements.

Let’s take a look at the annual financial statements of Google from 2013 to 2016. This is the summary of the first digits in these financial statements.

Figure 4 – First Digits in Goolge’s Financial Statements

Look very similar to Benford’s law, right? Here’s is a side by side comparison with Benford’s law.

Figure 5 – Side by Side Comparison of Google and Benford

Overall, there is a general agreement between first digits in Google’s financial statements and the predicted percentages according to Benford’s law. There are some noticeable differences. For example, the first digit 1 shows up about 35% of the time in Google’s statements versus 30% of the time according to Benford’s law. The differences are not statistically significant in this case. This conclusion is based on a statistical test called the chi-squared test.

The comparison between the first digits in Google’s financial statements and Benford’s law shows that Benford’s law is a powerful tool for detecting financial frauds.

How do you use Benford’s law? Compare the actual frequencies of the first digits in a set of financial statements with the predicted frequencies according to the Benford’s law. Anyone who fakes financial data at random will not produce data that look convincing. Even when the first digits do not distribute uniformly, too big of a discrepancy between the actual first digits and the Benford’s law (e.g. too few 1’s or too many 7’s, 8’s and 9’s) will raise a giant red flag, at which time the investigator can use more sophisticated tests for further evaluation.

Remember thus guy?

Figure 6 – Bernie Madoff’s Mug Shot

Bernie Madoff perpetrated most likely the most massive Ponzi scheme in all of history. His operation would have a constant need to make up numbers for the purpose of keeping up the appearance of legitimate investing. There was a study showing that the first digits in the monthly returns over a 215-month period did not conform to Benford’s law. So Bernie Madoff could have been caught a lot sooner if auditors and regulators were willing to look more closely.

Whether the financial data in question are or are not close to Benford’s law does not prove anything. But too big of a discrepancy should raise suspicion. Then the investigator can further test or evaluate using more sophisticated methods.

There are other applications in addition to fraud detection and forensic accounting. Benford’s law can be used to detect changes in natural processes (e.g. earthquake detection) and as a tool to assess the appropriateness of mathematical models (e.g. assessing whether projected population counts make sense).

This post is an abbreviated version of an article on Benford’s law in an affiliated blog.

\copyright 2017 – Dan Ma


Law of large numbers

Gambling games are a great way to illustrate the law of large numbers, a fundamental principle in probability. In a gambling context, it states that the individual bets can be unpredictable but in the long run (after hundreds or thousands or more bets), the results of the bets are stable and predictable. Predictable in this case mean that the casino always win. Furthermore, the average win of the casino approaches the house edge, which is the theoretical winning per bet. Of course, this principle describes many other random phenomena that have nothing to do with gambling. In this post we point out several examples of illustrations of the law of large numbers.

From a game of chance perspective, one way to confirm the law of large numbers is through actual gambling (with real money). Walk into casino several nights in a row and play. Then see what happens at the end of each night. Of course, this could be a very costly way to obtain data. Another way is through simulation, which can be done by hand (rolling dice and flipping coins) or performed in a computer. Game results can be generated using computer generated random numbers based on the presumed odds of the game in question. In a computer simulation, thousands or tens of thousands of bets can be simulated. We can then see what the long run results look like.

Take the game of roulette as an example. The payout rules of the game are designed in such a way that the house edge of any bet that can be made is always 5.26%. This is the theoretical average winning of the house (or the theoretical loss of the gambler) per unit bet. So for each $1 bet (say the bet on the color red), the house is expect to win $0.0526 (5.26 cents). Per $100 bet, the house is expected to win $5.26. If the total dollar value of the bets made in a given night is $1 million, then the house expects to collect $52,600. As long as customers are flowing into the casino, the casino wins. The mathematical aspect is discussed in this post in a companion blog. Simulations are demonstrated in this post.

Another simulation is performed on the carnival game of Chuck-a-Luck (see here). Some people mistakenly think that the bets in Chuck-a-Luck are made in even-odds basis. The house edge is actually close to 8% (worse odds than the roulette wheel for the gambler’s perspective).

Here’s an example of a simulation on a context outside of gambling. In rolling a fair die, the value of the die can range from 1 to 6 and is random and unpredictable. Yet the long run results are actually stable and predictable. The simulated values of the dice average out to be around 3.5 in the long run. See this post in a companion blog for details. The blog post describes a random experiment called n-dice gas station. This is a gas station where you roll dice to determine the price of gas. For example, in the 1-die gas station, the price is determined by rolling a fair die (whatever the value of the die that comes up, that will be the price of gas per gallon). In a 2-dice gas station, you roll two dice and the price of gas per gallon is the average of the two dice. The blog post is to illustrate the sampling distribution of the sample mean as well as the law of large numbers.

Here’s a simulation of the law of large numbers in the context of the coupon collector problem (see here).

The possibilities for simulations for gambling games and other random phenomena are endless. For any one who thinks that he can “beat the odds” over a long series of bets, just try to simulate bets by rolling dice and/or tossing coins. A clear advantage for the house should emerge fairly quickly.

\copyright 2017 – Dan Ma