E9.2: Statistics on Lead in Drinking Water

Example: Statistics from a Data File

The comma-separated data file, Flint-Pb.csv, contains data collected by a citizen science campaign to measure the concentration of lead (in parts per billion, ppb) in the drinking water of Flint, Michigan in the US. We will determine the maximum, mean, median and standard deviation of these data. The legal action level for lead concentrations is 15 ppb: we will also count the number of samples that exceed this level.

The first few lines of the file are as follows:

# Data from FlintWaterStudy.org (2015) "Lead Results from Tap Water Sampling
# in Flint, MI during the Flint Water Crisis"
# Lead concentration in parts per billion ("Bottle 1, first draw").
SampleID,Pb / ppb
1,0.344
2,8.133
4,1.111
5,8.007
6,1.951
7,7.2
8,40.63
9,1.1
12,10.6
13,6.2
14,-
15,4.358
16,24.37
17,6.609
18,4.062
19,2.484
...

There are four header lines, and missing data (e.g. for sample ID 14) is represented by a hyphen. These data can be read by numpy.genfromtxt:

import numpy as np

sampleID, Pb_ppb = np.genfromtxt('Flint-Pb.csv', skip_header=4,
                                 unpack=True, delimiter=',')

We can look at the first 20 values, to see that the missing data appear as NaN values:

Pb_ppb[:20]
array([ 0.344,  8.133,  1.111,  8.007,  1.951,  7.2  , 40.63 ,  1.1  ,
       10.6  ,  6.2  ,    nan,  4.358, 24.37 ,  6.609,  4.062,  2.484,
        0.438,  1.29 ,  0.548,  3.131])

There are two ways to obtain the necessary statistics. Most straightforward is to use the NumPy methods that ignore NaN values:

maximum = np.nanmax(Pb_ppb)
mean = np.nanmean(Pb_ppb)
median = np.nanmedian(Pb_ppb)
std = np.nanstd(Pb_ppb)

print(f'[Pb] Maximum = {maximum:.1f} ppb')
print(f'[Pb] Mean = {mean:.1f} ppb')
print(f'[Pb] Median = {median:.1f} ppb')
print(f'[Pb] Standard deviation = {std:.1f} ppb')
[Pb] Maximum = 158.0 ppb
[Pb] Mean = 10.6 ppb
[Pb] Median = 3.5 ppb
[Pb] Standard deviation = 21.5 ppb

Alternatively, we can remove the NaN values from the Pb_ppb array and use the regular NumPy statistics methods:

Pb_ppb = Pb_ppb[~np.isnan(Pb_ppb)]
np.max(Pb_ppb), np.mean(Pb_ppb), np.median(Pb_ppb), np.std(Pb_ppb)
(158.0, 10.6459926199262, 3.521, 21.520960778768078)

Our code is clearest if we name the lead concentration "action level" of 15 ppb as a suitable variable:

ACTION_LEVEL_Pb_ppb = 15
n_high_Pb = np.sum(Pb_ppb > ACTION_LEVEL_Pb_ppb)
print(f'Number of samples exceeding the "action level": {n_high_Pb}'
      f' out of {len(Pb_ppb)}')
Number of samples exceeding the "action level": 45 out of 271