Plotting COVID-19 cases

(5 comments)

The Centre for Systems Science and Engineering (CSSE) at Johns Hopkins University publishes daily statistics of the number of confirmed cases of COVID-19 by country on its GitHub page. The short script below pulls data from this page to plot a bar chart of cases and growth in cases as a function of time for a given country. For example:

COVID-19 confirmed cases in the UK to 26 March 2020

COVID-19 confirmed cases in South Korea to 26 March 2020

Change the value of the variable country to plot for a different country, using one of the values in the "Country/Region" column of the CSV file in time_series_covid19_confirmed_global.csv.

import sys
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

# If you have saved a local copy of the CSV file as LOCAL_CSV_FILE,
# set READ_FROM_URL to True
READ_FROM_URL = True
LOCAL_CSV_FILE = 'covid-19-cases.csv'

# Start the plot on the day when the number of confirmed cases reaches MIN_CASES.
MIN_CASES = 100

# The country to plot the data for.
country = 'United Kingdom'

# This is the GitHub URL for the Johns Hopkins data in CSV format
data_loc = ('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/'
           'csse_covid_19_data/csse_covid_19_time_series'
           '/time_series_covid19_confirmed_global.csv')

# Read in the data to a pandas DataFrame.
if not READ_FROM_URL:
    data_loc = LOCAL_CSV_FILE
df = pd.read_csv(data_loc)

# Group by country and sum over the different states/regions of each country.
grouped = df.groupby('Country/Region')
df2 = grouped.sum()

def make_plot(country):
    """Make the bar plot of case numbers and change in numbers line plot."""

    # Extract the Series corresponding to the case numbers for country.
    c_df = df2.loc[country, df2.columns[3:]]
    # Discard any columns with fewer than MIN_CASES.
    c_df = c_df[c_df >= MIN_CASES].astype(int)
    # Convet index to a proper datetime object
    c_df.index = pd.to_datetime(c_df.index)
    n = len(c_df)
    if n == 0:
        print('Too few data to plot: minimum number of cases is {}'
                .format(MIN_CASES))
        sys.exit(1)

    fig = plt.Figure()

    # Arrange the subplots on a grid: the top plot (case number change) is
    # one quarter the height of the bar chart (total confirmed case numbers).
    ax2 = plt.subplot2grid((4,1), (0,0))
    ax1 = plt.subplot2grid((4,1), (1,0), rowspan=3)
    ax1.bar(range(n), c_df.values)
    # Force the x-axis to be in integers (whole number of days) in case
    # Matplotlib chooses some non-integral number of days to label).
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))

    c_df_change = c_df.diff()
    ax2.plot(range(n), c_df_change.values)
    ax2.set_xticks([])

    ax1.set_xlabel('Days since {} cases'.format(MIN_CASES))
    ax1.set_ylabel('Confirmed cases, $N$')
    ax2.set_ylabel('$\Delta N$')

    # Add a title reporting the latest number of cases available.
    title = '{}\n{} cases on {}'.format(country, c_df[-1],
                c_df.index[-1].strftime('%d %B %Y'))
    plt.suptitle(title)

make_plot(country)
plt.show()
Current rating: 4.2

Comments

Comments are pre-moderated. Please be patient and your comment will appear soon.

Sean Rommel 4 years ago

This is an excellent routine. I'm making a few edits I'd be happy to share with you. Specifically, I'm recasting the plots as semilogy to show the beginning of roll-over/flattening. I'm hoping to also build a routine that will extract/plot USA data by state/county. I'd be happy to collaborate as my job permits. Still learning Python, but can help here if you are interested.
Sincerely,
Sean Rommel (Rochester, NY USA)

Link | Reply
Current rating: 3.5

christian 4 years ago

I would be very interested to see your improvements: the code in this post was updated in the one following it, and lives on GitHub at https://github.com/xnx/covid-19 – feel free to fork and send me a PR.
Cheers,
Christian

Link | Reply
Currently unrated

Joseph Levine 3 years, 9 months ago

Hello,

I think there is an error in how you remove min cases.

If a cases count time series was
[0, 0, 10, 100, 99, 200]
we would get a graph that said 200 cases occurred on day 2 since 100 cases.

In a more extreme example, China had 0 cases for several weeks. All that time will just be dropped using this method. If I come up with something better I'll PR

Cheers,
Joseph

Link | Reply
Currently unrated

christian 3 years, 9 months ago

Hi Joseph,
Please do correct me if I've got this wrong, but I think the data being read in are cumulative numbers not daily numbers, so only the first MIN_CASES are dropped from the data set.
Cheers, Christian

Link | Reply
Currently unrated

Joseph Levine 3 years, 9 months ago

Hi Christian,

You are corret! I converted to daily cases and mixed myself up. Thanks for the prompt reply and the slick solution.

Joseph

Link | Reply
Currently unrated

New Comment

required

required (not published)

optional

required