London Underground usage statistics

Question P9.2.4

Transport for London (TfL) is the UK local government body responsible for the public transport system of Greater London; they make available an Excel document, which provides statistics about the usage of the underground network (the Tube) in the form of entry and exit passenger numbers for a "typical" day at each station over the years 2007–2017.

Read in this document with pandas, and analyze it to determine:

(a) the busiest station on a typical weekday in 2017;

(b) the station with the greatest percentage increase in passengers over the period 2007–2017;

(c) the station with the largest relative difference in passenger numbers between the working week and a typical Sunday in 2017.

Solution P9.2.4

One approach, yielding the following report on the provided data at the time of writing:

Station with maximum entries, 2017: King's Cross St. Pancras (149150 daily)
Station with the greatest percentage increase in passengers: Chesham (0.431532 million in 2007 vs. 1.133596 million in 2017, +162.7%
Station with the largest relative difference in passenger numbers between the working week and a typical Sunday in 2017:
Moorgate (46698 on a weekday, 3961 on Sunday)

The code:

import sys
import pandas as pd

df_dict = pd.read_excel('multi-year-station-entry-and-exit-figures.xlsx',
                        sheet_name=None, skiprows=6)
# Remove DataFrames created from Sheets which are not relevant to us.
del df_dict['Notes']
del df_dict['workings']

# A dictionary of DataFrames, keyed by year.
df_dict = dict((int(k[:4]), df.drop('Borough', axis=1, errors='ignore')
                .rename({'Weekday': 'entry-weekday', 'Saturday': 'entry-sat',
                         'Sunday': 'entry-sun', 'Weekday.1': 'exit-weekday',
                         'Saturday.1': 'exit-sat', 'Sunday.1': 'exit-sun',
                         'million': 'annual'
                        }, axis=1)
               ) for k, df in df_dict.items())

# -- (a) --
# Just the data from 2017, excluding non-station rows.
df2017 = df_dict[2017][df_dict[2017]['nlc'].notna()]
max_weekday_entries_2017 = df2017.loc[df2017['entry-weekday'].idxmax()]
station, entries = max_weekday_entries_2017.loc[['Station', 'entry-weekday']]
print('Station with maximum entries, 2017: {} ({} daily)'.format(
            station, int(entries)))

# -- (b) --
# Set the index to be the station name, extract data from 2007 as well.
df2017.index = df2017['Station'].str.strip()
df2007 = df_dict[2007][df_dict[2007]['nlc'].notna()]
df2007.index = df2007['Station'].str.strip()
# Get Series for the annual numbers on these years (in millions).
annual_2007 = df2007['annual'].rename('Annual 2007')
annual_2017 = df2017['annual'].rename('Annual 2017')
# Form a DataFrame, calculate the percentage increase, extract its maximum.
df = pd.concat([annual_2007, annual_2017], axis=1)
df['% increase'] = ((df['Annual 2017'] - df['Annual 2007'])
                        / df['Annual 2007'] * 100)
station = df['% increase'].idxmax()
print('Station with the greatest percentage increase in passengers:'
      ' {} ({} million in 2007 vs. {} million in 2017, {:+.1f}%'
      .format(station, *df.loc[station]))

# -- (c) --
# Determine the row with the greatest relative weekday passenger numbers
idx =((df2017['entry-weekday'] - df2017['entry-sun'])
                / df2017['entry-weekday']).idxmax()
row = df2017.loc[idx]
print('Station with the largest relative difference in passenger numbers'
      ' between the working week and a typical Sunday in 2017:\n{}'
      ' ({} on a weekday, {} on Sunday)'.format(row['Station'],
        int(row['entry-weekday']), int(row['entry-sun'])))