Benford’s Law

Take a look at a natural data set, e.g. the populations of all the counties in the United States.

Figure 1 – County Map

There are over 3,000 counties. If you look at all the 3000+ population counts, the first digit is 1 about 30% of the time, the first digit is 2 about 18% of the time, and 9 about 5% of the time.

Figure 2

The digit 1 shows up as a first digit about 6 times as often as 9. Amazing! This phenomenon is called Benford’s law.

Figure 3 – Benford’s Law

This phenomenon shows up in many other data sets. In addition to population data, it shows up in geographical data (e.g. areas of rivers), baseball statistics, numbers in magazine articles, numbers in street addresses, house prices and stock market trading data. Benford’s law also has practical applications in business because it shows up in corporate financial statements.

Let’s take a look at the annual financial statements of Google from 2013 to 2016. This is the summary of the first digits in these financial statements.

Figure 4 – First Digits in Goolge’s Financial Statements

Look very similar to Benford’s law, right? Here’s is a side by side comparison with Benford’s law.

Figure 5 – Side by Side Comparison of Google and Benford

Overall, there is a general agreement between first digits in Google’s financial statements and the predicted percentages according to Benford’s law. There are some noticeable differences. For example, the first digit 1 shows up about 35% of the time in Google’s statements versus 30% of the time according to Benford’s law. The differences are not statistically significant in this case. This conclusion is based on a statistical test called the chi-squared test.

The comparison between the first digits in Google’s financial statements and Benford’s law shows that Benford’s law is a powerful tool for detecting financial frauds.

How do you use Benford’s law? Compare the actual frequencies of the first digits in a set of financial statements with the predicted frequencies according to the Benford’s law. Anyone who fakes financial data at random will not produce data that look convincing. Even when the first digits do not distribute uniformly, too big of a discrepancy between the actual first digits and the Benford’s law (e.g. too few 1’s or too many 7’s, 8’s and 9’s) will raise a giant red flag, at which time the investigator can use more sophisticated tests for further evaluation.

Remember thus guy?

Figure 6 – Bernie Madoff’s Mug Shot

Bernie Madoff perpetrated most likely the most massive Ponzi scheme in all of history. His operation would have a constant need to make up numbers for the purpose of keeping up the appearance of legitimate investing. There was a study showing that the first digits in the monthly returns over a 215-month period did not conform to Benford’s law. So Bernie Madoff could have been caught a lot sooner if auditors and regulators were willing to look more closely.

Whether the financial data in question are or are not close to Benford’s law does not prove anything. But too big of a discrepancy should raise suspicion. Then the investigator can further test or evaluate using more sophisticated methods.

There are other applications in addition to fraud detection and forensic accounting. Benford’s law can be used to detect changes in natural processes (e.g. earthquake detection) and as a tool to assess the appropriateness of mathematical models (e.g. assessing whether projected population counts make sense).

This post is an abbreviated version of an article on Benford’s law in an affiliated blog.

\copyright 2017 – Dan Ma


Law of large numbers

Gambling games are a great way to illustrate the law of large numbers, a fundamental principle in probability. In a gambling context, it states that the individual bets can be unpredictable but in the long run (after hundreds or thousands or more bets), the results of the bets are stable and predictable. Predictable in this case mean that the casino always win. Furthermore, the average win of the casino approaches the house edge, which is the theoretical winning per bet. Of course, this principle describes many other random phenomena that have nothing to do with gambling. In this post we point out several examples of illustrations of the law of large numbers.

From a game of chance perspective, one way to confirm the law of large numbers is through actual gambling (with real money). Walk into casino several nights in a row and play. Then see what happens at the end of each night. Of course, this could be a very costly way to obtain data. Another way is through simulation, which can be done by hand (rolling dice and flipping coins) or performed in a computer. Game results can be generated using computer generated random numbers based on the presumed odds of the game in question. In a computer simulation, thousands or tens of thousands of bets can be simulated. We can then see what the long run results look like.

Take the game of roulette as an example. The payout rules of the game are designed in such a way that the house edge of any bet that can be made is always 5.26%. This is the theoretical average winning of the house (or the theoretical loss of the gambler) per unit bet. So for each $1 bet (say the bet on the color red), the house is expect to win $0.0526 (5.26 cents). Per $100 bet, the house is expected to win $5.26. If the total dollar value of the bets made in a given night is $1 million, then the house expects to collect $52,600. As long as customers are flowing into the casino, the casino wins. The mathematical aspect is discussed in this post in a companion blog. Simulations are demonstrated in this post.

Another simulation is performed on the carnival game of Chuck-a-Luck (see here). Some people mistakenly think that the bets in Chuck-a-Luck are made in even-odds basis. The house edge is actually close to 8% (worse odds than the roulette wheel for the gambler’s perspective).

Here’s an example of a simulation on a context outside of gambling. In rolling a fair die, the value of the die can range from 1 to 6 and is random and unpredictable. Yet the long run results are actually stable and predictable. The simulated values of the dice average out to be around 3.5 in the long run. See this post in a companion blog for details. The blog post describes a random experiment called n-dice gas station. This is a gas station where you roll dice to determine the price of gas. For example, in the 1-die gas station, the price is determined by rolling a fair die (whatever the value of the die that comes up, that will be the price of gas per gallon). In a 2-dice gas station, you roll two dice and the price of gas per gallon is the average of the two dice. The blog post is to illustrate the sampling distribution of the sample mean as well as the law of large numbers.

Here’s a simulation of the law of large numbers in the context of the coupon collector problem (see here).

The possibilities for simulations for gambling games and other random phenomena are endless. For any one who thinks that he can “beat the odds” over a long series of bets, just try to simulate bets by rolling dice and/or tossing coins. A clear advantage for the house should emerge fairly quickly.

\copyright 2017 – Dan Ma

The gambler’s ruin

Here’s two formulas that tell a great story about what would happen to a gambler if he or she plays long enough in a casino.

Formula 1

\displaystyle A_i=\frac{i}{n}

Formula 2

\displaystyle A_i=\frac{\displaystyle 1-\biggl(\frac{q}{p} \biggr)^i}{\displaystyle 1- \biggl(\frac{q}{p} \biggr)^n}

Both formulas are answers to the following problem called the gambler’s ruin.

The Gambler’s Ruin

Two gamblers, A and B, are betting on the tosses of a coin such that the probability of getting a head is p. Let q=1-p, which would be the probability of getting a tail in a coin toss. At the beginning of the game, player A has i units of wealth and player B has n-i units (together both players have a combined wealth of n units at the beginning). In each play of the game, the coin is tossed. If the result of the coin toss is head, player A collects 1 unit from player B. If the result of the coin toss is tail, player A pays player B 1 unit. The game continues until one of the players has all the n units of wealth. What is the probability that player A ends up with all n units? What is the probability that player B ends up with all n units?

This problem is very interesting if there is a great disparity in wealth between the two players. If the wealth of one of the players is a tiny percentage of the other player, then the problem would describe the game between a gambler and a casino.

Imagine that player A is a gambler and player B is the casino. In this scenario, the number n is a large number and i is a very small number relative to n. For example, the casino has $999,000 on hand and the gambler has only $1,000 so that the combined wealth is $1,000,000. One unit is $1. If the game is played in the manner described above, what is the probability that player A ends up with $1,000,000? This would be the probability of player A breaking the bank. Invest $1,000 and end up with $1 million. How likely is that? What is the probability of player B (casino) ends up with $1,000,000? This would be the probability that player A losing the $1,000 he starts with (the probability of ruin for player A).

Let’s explain what A_i is in the above two formulas. If player A starts with i units of wealth and player B starts with n-i units, let A_i be the probability that player A will end up with all n units of wealth and let B_i be the probability that player B will end up with all n units of wealth. Of course, this is assuming that the game continues until one of the players is broke.

Formula 1 is the case that the coin used in the game is a fair coin so that p = 0.5. Formula 2 is the case that the coin is not a fair coin so that p is not 0.5.

Also A_i+B_i is always 1.0. When A_i is computed using one of the above formulas, B_i=1-A_i.

In the event that the gambler plays a fair game in a casino, the probability that he or she will win everything is the ratio of the gambler’s wealth to the total combined wealth of the gambler and the casino. In the case that player A only has $1,000 and the combined bankroll is $1,000,000, A_i = 1,000 / 1,000,000 = 0.001 (0.1%). The probability of ruin for player A is then 99.9%. So there is a 99.9% chance that player A will lose the $1,000. But this is only if they play a fair game in the casino, which they don’t.

Let’s say the game is biased for the house so that the house has 1% edge. So the probability of getting a head is p = 0.49. If the exponent is too big, the calculator would give an overflow error. So let’s assume that the gambler has 5 units (i = 5) and the total bankroll is n = 1,000. Then we would use Formula 2 to find A_{5}.

    \displaystyle A_{5}=\frac{\displaystyle 1-\biggl(\frac{0.51}{0.49} \biggr)^5}{\displaystyle 1- \biggl(\frac{0.51}{0.49} \biggr)^{1000}}=9.35731 \times 10^{-19}

This is virtually a zero probability. This means B_5 = 1. So the probability of ruin is virtually certain. We all know that the casino always has the upper hand. The formulas tell a great story.

For more information about gamber’s ruin, see the following two blog posts. This post explains how the two formulas are derived. This post explain how the formulas are used and has further commentary.

\copyright 2017 – Dan Ma