Benford's Law

Question P2.4.7

Benford's Law is an observation about the distribution of the frequencies of the first digits of the numbers in many different data sets. It is frequently found that the first digits are not uniformly distributed, but follow the logarithmic distribution $$ P(d) = \log_{10}\left( \frac{d+1}{d} \right). $$ That is, numbers starting with 1 are more common than those starting with 2, and so on, with those starting with 9 the least common. The probabilities are given below:

Digit	Probability
1	0.301
2	0.176
3	0.125
4	0.097
5	0.079
6	0.067
7	0.058
8	0.051
9	0.046

Benford's Law is most accurate for data sets which span several orders of magnitude, and can be proved to be exact for some infinite sequences of numbers.

(a) Demonstrate that the first digits of the first 500 Fibonacci numbers (see this Example) follow Benford's Law quite closely.

(b) The length of the amino acid sequences of 500 randomly-chosen proteins are provided in the file protein_lengths.py. This file contains a list, naa, which can be imported at the start of your program with

from protein_lengths import naa

To what extent does the distribution of protein lengths obey Benford's Law?

Solution P2.4.7

(a) In the code below we calculate the first 500 Fibonacci numbers, determine the first digit of each and turn the total numbers of each digit encountered into a probability distribution for comparison with Benford's Law.

import math

# We will analyse the first n Fibonacci numbers
n = 500
# The first 2 numbers of the series are 1, 1
a = b = 1
# fd is a list such that fd[d] is the number of Fibonacci numbers up to n
# which start with the digit d. Note that none start with 0!
fd = [None,2,0,0,0,0,0,0,0,0]

# Loop over the 3rd, 4th, ..., nth Fibonacci numbers
for i in range(3,n+1):
    # This is the propagation step: generate the current Fibonacci number, a,
    # as the sum of the last two numbers (a+b), and store the old value of a
    # as b (the old value of b is forgotten).
    a, b = a+b, a
    # The first digit of a
    d = int(a/10**(int(math.log10(a))))
    fd[d] += 1

# The digits 1,2,3,4,5,6,7,8,9
digits = range(1,10)
benford = [None]
for d in digits:
    # Normalize fd by dividing by n so it represents a probability
    fd[d] /= n
    # Use Benford's Law to predict the frequency of first digit d
    benford.append(math.log10((d+1)/d))

# Output a table of the predicted and observed distribution of first digits
print('-'*27)
print('Digit  Predicted   Observed')
print('-'*27)
for d in digits:
    print('  {:1d}       {:5.3f}      {:5.3f}'.format(d, benford[d], fd[d]))
print('-'*27)

The output from this program shows close agreement with Benford's Law:

---------------------------
Digit  Predicted   Observed
---------------------------
  1       0.301      0.302
  2       0.176      0.176
  3       0.125      0.126
  4       0.097      0.094
  5       0.079      0.080
  6       0.067      0.066
  7       0.058      0.058
  8       0.051      0.054
  9       0.046      0.044
---------------------------

We adopt the mathematical approach to retrieving the first digit of a Fibonacci number: $$ d = \left\lfloor \frac{a}{10^{\lfloor{\log_{10}a}\rfloor}} \right\rfloor $$ where $\lfloor x \rfloor$ denotes the floor of $x$: the largest integer smaller than $x$, which is what Python's int casting function gives.

An alternative way to retrieve the first digit is slightly hacky and (for large n) rather slower: turn the Fibonacci number into a string, get the first character of the string and cast it back into an integer: d = int(str(a)[0]).

Yet another way involves repeated integer division by 10 until the number becomes a single digit:

d = a
while d >= 10:
    d //= 10

For large n, this method is bound to be slower than the presented solution which effectively calculates how many times to divide by 10 before doing so.

(b) We can analyse the provided protein sequence length data as follows

import math
from ex2-4_e_ii_protein_lengths import naa

# fd is a list such that fd[d] is the number of protein sequence lengths
# which start with the digit d. Note that none start with 0!
fd = [None,0,0,0,0,0,0,0,0,0]

for a in naa:
    d = int(a/10**(int(math.log10(a))))
    fd[d] += 1

# The digits 1,2,3,4,5,6,7,8,9
digits = range(1,10)
benford = [None]
n = len(naa)
for d in digits:
    # Normalize fd by dividing by n so it represents a probability
    fd[d] /= n
    # Use Benford's Law to predict the frequency of first digit d
    benford.append(math.log10((d+1)/d))

# Output a table of the predicted and observed distribution of first digits
print('-'*27)
print('Digit  Predicted   Observed')
print('-'*27)
for d in digits:
    print('  {:1d}       {:5.3f}      {:5.3f}'.format(d, benford[d], fd[d]))
print('-'*27)

The output shows qualitative but far from quantitative agreement with Benford's Law, perhaps because the length of these sequences does not span more than 3 orders of magnitude or so ($10^3 - 10^5$ amino acids).

---------------------------
Digit  Predicted   Observed
---------------------------
  1       0.301      0.250
  2       0.176      0.190
  3       0.125      0.200
  4       0.097      0.116
  5       0.079      0.090
  6       0.067      0.062
  7       0.058      0.036
  8       0.051      0.034
  9       0.046      0.022
---------------------------