Read in this document with pandas, and analyze it to determine:
(a) the busiest station on a typical weekday in 2017;
(b) the station with the greatest percentage increase in passengers over the period 2007–2017;
(c) the station with the largest relative difference in passenger numbers between the working week and a typical Sunday in 2017.
Solution P9.2.4
One approach, yielding the following report on the provided data at the time of writing:
Station with maximum entries, 2017: King's Cross St. Pancras (149150 daily)
Station with the greatest percentage increase in passengers:
Chesham (0.431532 million in 2007 vs. 1.133596 million in 2017, +162.7%
Station with the largest relative difference in passenger numbers
between the working week and a typical Sunday in 2017:
Moorgate (46698 on a weekday, 3961 on Sunday)
The code:
import sys
import pandas as pd
df_dict = pd.read_excel(
"multi-year-station-entry-and-exit-figures.xlsx",
sheet_name=None,
skiprows=6,
)
# Remove DataFrames created from Sheets which are not relevant to us.
del df_dict["Notes"]
del df_dict["workings"]
# A dictionary of DataFrames, keyed by year.
df_dict = dict(
(
int(k[:4]),
df.drop("Borough", axis=1, errors="ignore").rename(
{
"Weekday": "entry-weekday",
"Saturday": "entry-sat",
"Sunday": "entry-sun",
"Weekday.1": "exit-weekday",
"Saturday.1": "exit-sat",
"Sunday.1": "exit-sun",
"million": "annual",
},
axis=1,
),
)
for k, df in df_dict.items()
)
# -- (a) --
# Just the data from 2017, excluding non-station rows.
df2017 = df_dict[2017][df_dict[2017]["nlc"].notna()]
max_weekday_entries_2017 = df2017.loc[df2017["entry-weekday"].idxmax()]
station, entries = max_weekday_entries_2017.loc[["Station", "entry-weekday"]]
print(f"Station with maximum entries, 2017: {station} ({int(entries)} daily)")
# -- (b) --
# Set the index to be the station name, extract data from 2007 as well.
df2017.index = df2017["Station"].str.strip()
df2007 = df_dict[2007][df_dict[2007]["nlc"].notna()]
df2007.index = df2007["Station"].str.strip()
# Get Series for the annual numbers on these years (in millions).
annual_2007 = df2007["annual"].rename("Annual 2007")
annual_2017 = df2017["annual"].rename("Annual 2017")
# Form a DataFrame, calculate the percentage increase, extract its maximum.
df = pd.concat([annual_2007, annual_2017], axis=1)
df["% increase"] = (
(df["Annual 2017"] - df["Annual 2007"]) / df["Annual 2007"] * 100
)
station = df["% increase"].idxmax()
print(
"Station with the greatest percentage increase in passengers:\n"
"{} ({} million in 2007 vs. {} million in 2017, {:+.1f}%".format(
station, *df.loc[station]
)
)
# -- (c) --
# Determine the row with the greatest relative weekday passenger numbers
idx = (
(df2017["entry-weekday"] - df2017["entry-sun"]) / df2017["entry-weekday"]
).idxmax()
row = df2017.loc[idx]
weekday_entry = int(row["entry-weekday"])
sunday_entry = int(row["entry-sun"])
print(
f"Station with the largest relative difference in passenger numbers\n"
f"between the working week and a typical Sunday in 2017:\n"
f"{row['Station']} ({weekday_entry} on a weekday, {sunday_entry}"
f" on Sunday)"
)