London Underground usage statistics

Learning Scientific Programming with Python (2nd edition)

P9.2.4: London Underground usage statistics

Question P9.2.4

Transport for London (TfL) is the UK local government body responsible for the public transport system of Greater London; they make available an Excel document, which provides statistics about the usage of the underground network (the Tube) in the form of entry and exit passenger numbers for a "typical" day at each station over the years 2007–2017.

Read in this document with pandas, and analyze it to determine:

(a) the busiest station on a typical weekday in 2017;

(b) the station with the greatest percentage increase in passengers over the period 2007–2017;

(c) the station with the largest relative difference in passenger numbers between the working week and a typical Sunday in 2017.

Solution P9.2.4

One approach, yielding the following report on the provided data at the time of writing:

Station with maximum entries, 2017: King's Cross St. Pancras (149150 daily)
Station with the greatest percentage increase in passengers:
Chesham (0.431532 million in 2007 vs. 1.133596 million in 2017, +162.7%
Station with the largest relative difference in passenger numbers
between the working week and a typical Sunday in 2017:
Moorgate (46698 on a weekday, 3961 on Sunday)

The code:

import sys
import pandas as pd

df_dict = pd.read_excel(
    "multi-year-station-entry-and-exit-figures.xlsx",
    sheet_name=None,
    skiprows=6,
)
# Remove DataFrames created from Sheets which are not relevant to us.
del df_dict["Notes"]
del df_dict["workings"]

# A dictionary of DataFrames, keyed by year.
df_dict = dict(
    (
        int(k[:4]),
        df.drop("Borough", axis=1, errors="ignore").rename(
            {
                "Weekday": "entry-weekday",
                "Saturday": "entry-sat",
                "Sunday": "entry-sun",
                "Weekday.1": "exit-weekday",
                "Saturday.1": "exit-sat",
                "Sunday.1": "exit-sun",
                "million": "annual",
            },
            axis=1,
        ),
    )
    for k, df in df_dict.items()
)

# -- (a) --
# Just the data from 2017, excluding non-station rows.
df2017 = df_dict[2017][df_dict[2017]["nlc"].notna()]
max_weekday_entries_2017 = df2017.loc[df2017["entry-weekday"].idxmax()]
station, entries = max_weekday_entries_2017.loc[["Station", "entry-weekday"]]
print(f"Station with maximum entries, 2017: {station} ({int(entries)} daily)")

# -- (b) --
# Set the index to be the station name, extract data from 2007 as well.
df2017.index = df2017["Station"].str.strip()
df2007 = df_dict[2007][df_dict[2007]["nlc"].notna()]
df2007.index = df2007["Station"].str.strip()
# Get Series for the annual numbers on these years (in millions).
annual_2007 = df2007["annual"].rename("Annual 2007")
annual_2017 = df2017["annual"].rename("Annual 2017")
# Form a DataFrame, calculate the percentage increase, extract its maximum.
df = pd.concat([annual_2007, annual_2017], axis=1)
df["% increase"] = (
    (df["Annual 2017"] - df["Annual 2007"]) / df["Annual 2007"] * 100
)
station = df["% increase"].idxmax()
print(
    "Station with the greatest percentage increase in passengers:\n"
    "{} ({} million in 2007 vs. {} million in 2017, {:+.1f}%".format(
        station, *df.loc[station]
    )
)

# -- (c) --
# Determine the row with the greatest relative weekday passenger numbers
idx = (
    (df2017["entry-weekday"] - df2017["entry-sun"]) / df2017["entry-weekday"]
).idxmax()
row = df2017.loc[idx]
weekday_entry = int(row["entry-weekday"])
sunday_entry = int(row["entry-sun"])
print(
    f"Station with the largest relative difference in passenger numbers\n"
    f"between the working week and a typical Sunday in 2017:\n"
    f"{row['Station']} ({weekday_entry} on a weekday, {sunday_entry}"
    f" on Sunday)"
)