Using NumPy's loadtxt method

The use of np.loadtxt is best illustrated using an example. Consider the following text file of data relating to a (fictional) population of students. This file can be downloaded as eg6-a-student-data.txt.

# Student data collected on 17 July 2014.
# Researcher: Dr Wicks, University College Newbury.

# The following data relate to N = 20 students. It
# has been totally made up and so therefore is 100%
# anonymous.

Subject Sex    DOB      Height  Weight       BP     VO2max
(ID)    M/F  dd/mm/yy     m       kg        mmHg  mL.kg-1.min-1
JW-1     M    19/12/95    1.82     92.4    119/76   39.3
JW-2     M    11/1/96     1.77     80.9    114/73   35.5
JW-3     F    2/10/95     1.68     69.7    124/79   29.1
JW-6     M    6/7/95      1.72     75.5    110/60   45.5
# JW-7    F    28/3/96     1.66     72.4    101/68   -
JW-9     F    11/12/95    1.78     82.1    115/75   32.3
JW-10    F    7/4/96      1.60     -       -/-      30.1
JW-11    M    22/8/95     1.72     77.2    97/63    48.8
JW-12    M    23/5/96     1.83     88.9    105/70   37.7
JW-14    F    12/1/96     1.56     56.3    108/72   26.0
JW-15    F    1/6/96      1.64     65.0    99/67    35.7
JW-16    M    10/9/95     1.63     73.0    131/84   29.9
JW-17    M    17/2/96     1.67     89.8    101/76   40.2
JW-18    M    31/7/96     1.66     75.1    -/-      -
JW-19    F    30/10/95    1.59     67.3    103/69   33.5
JW-22    F    9/3/96      1.70     -       119/80   30.9
JW-23    M    15/5/95     1.97     89.2    124/82   -
JW-24    F    1/12/95     1.66     63.8    100/78   -
JW-25    F    25/10/95    1.63     64.4    -/-      28.0
JW-26    M    17/4/96     1.69     -       121/82   39.

Let's find the average heights of the male and female students. The columns we need are the second and fourth, and there's no missing data in these columns so we can use np.loadtxt. First construct a record dtype for the two fields, then read the relevant columns after skipping the first 9 header lines:

In [x]: fname = 'eg6-a-student-data.txt'
In [x]: dtype1 = np.dtype([('gender', '|S1'), ('height', 'f8')])
In [x]: a = np.loadtxt(fname, dtype=dtype1, skiprows=9, usecols=(1,3))
In [x]: a
Out[x]:
array([(b'M', 1.8200000524520874), (b'M', 1.7699999809265137),
       (b'F', 1.6799999475479126), (b'M', 1.7200000286102295),
       ...
       (b'M', 1.690000057220459)], 
      dtype=[('gender', 'S1'), ('height', '<f8')])

To find the average heights of the male students, we only want to index the records with the gender field as M, for which we can create a boolean array:

In [x]: m = a['gender'] == b'M'
In [x]: m
Out[x]: array([ True,  True, False,  True, ...,  True], dtype=bool)

m has entries that are True or False for each of the 19 valid records (one is commented out) according to whether the student is male or female. So the heights of the male students can be seen to be:

In [x]: print(a['height'][m])
[ 1.82000005  1.76999998  1.72000003  1.72000003  1.83000004  1.63
  1.66999996  1.65999997  1.97000003  1.69000006]

Therefore, the averages we need are

In [x]: m_av = a['height'][m].mean()
In [x]: f_av = a['height'][~m].mean()
In [x]: print('Male average: {:.2f} m, Female average: {:.2f} m'.format(m_av,f_av))
Male average: 1.75 m, Female average: 1.65 m

Note that ~m ("not m") is the inverse boolean array of m.

To perform the same analysis on the student weights we have a bit more work to do because there are some missing values (denoted by '-'). We could use np.genfromtxt (see Section 6.2.3 of the book), but let's write a converter method instead. We'll replace the missing values with the nicely unphysical value of -99. The function parse_weight expects a string argument and returns a float:

def parse_weight(s):
    try:
        return float(s)
    except ValueError:
        return -99.

This is the function we want to pass as a converter for column 4:

In [x]: dtype2 = np.dtype([('gender', '|S1'), ('weight', 'f8')])
In [x]: b = np.loadtxt(fname, dtype=dtype2, skiprows=9, usecols=(1,4),
                       converters={4: parse_weight})

Now mask off the invalid data and index the array with a boolean array as before:

In [x]: mv = b['weight'] > 0    # elements only True for valid data
In [x]: m_wav = b['weight'][mv & m].mean()      # valid and male
In [x]: f_wav = b['weight'][mv & ~m].mean()     # valid and female
In [x]: print('Male average: {:.2f} kg, Female average: {:.2f} kg'.format(m_wav,f_wav))
Male average: 82.44 kg, Female average: 66.94 kg

Finally, let's read in the blood pressure data. Here we have a problem, because the systolic and diastolic pressures are not separated by whitespace but by a forward slash (/). One solution is to reformat each line to replace the slash with a space before it is fed to np.loadtxt. Recall that fname can be a generator instead of a filename or open file: we write a suitable generator function, reformat_lines, which takes an open file object and yields its lines to np.loadtxt, one by one, after the replacement. This is going to mess with the column numbering because it has the side effect of splitting up the birth dates into three columns, so in our reformatted lines the blood pressure values are now in the columns indexed at 7 and 8.

import numpy as np

fname = 'eg6-a-student-data.txt'
dtype3 = np.dtype([('gender', '|S1'), ('bps', 'f8'), ('bpd', 'f8')])

def parse_bp(s):
    try:
        return float(s)
    except ValueError:
        return -99.

def reformat_lines(fi):
    for line in fi:
        line = line.replace('/',' ')
        yield line

with open(fname) as fi:
    gender, bps, bpd = np.loadtxt(reformat_lines(fi), dtype3, skiprows=9,
                usecols=(1,7,8),converters={7: parse_bp, 8: parse_bp},
                unpack=True)

# now do something with the data...