Analysing the history of volcanic eruptions with pandas

The file volcanic-eruptions.csv contains data concerning 822 significant volcanic events on Earth between 1750 BCE and 2020 CE from the US National Centres for Environmental Information (NCEI). The information on each event is given in comma-separated fields and includes date, volcano name, location, type, estimated number of human deaths and "Volcanic Explosivity Index" (VEI).

The data are readily parsed into a DataFrame with:

In [x]: df = pd.read_csv('volcanic-eruptions.csv', index_col=0)

The most deadly volcanic eruption in the database is that of Ilopango, around the middle of the fifth century CE:

In [x]: df.loc[df['Deaths'].idxmax()]
Out[x]: 
Year                 450
Month                NaN
Day                  NaN
Name            Ilopango
Location     El Salvador
Country      El Salvador
Latitude          13.672
Longitude        -89.053
Elevation            450
Type             Caldera
VEI                    6
Deaths             30000
Name: 25, dtype: object

It would be helpful to have a column with the day, month and year of the explosion parsed into a string. Define a helper function, get_date:

def get_date(year, month, day):
    if year < 0:
       s_year = f'{-year} BCE'
    else: 
       s_year = str(year)
    if pd.isnull(month):
       return s_year
    s_date = f'{int(month)}/{s_year}'
    if pd.isnull(day):
       return s_date
    return f'{int(day)}/{s_date}'

and apply it to the DataFrame:

In [x]: df['date'] = [get_date(year, month, day) for year, month, day in
                                    zip(df['Year'], df['Month'], df['Day'])]

Simple filtering can give us a list of eruptions with a VEI of at least 6 since the start of the nineteenth century:

In [x]: df[(df['VEI'] >= 6) & (df['Year'] >= 1800)]                              
Out[x]: 
     Year  Month   Day         Name  ...           Type  VEI   Deaths        date
218  1815    4.0  10.0      Tambora  ...  Stratovolcano  7.0  11000.0   10/4/1815
322  1883    8.0  27.0     Krakatau  ...        Caldera  6.0   2000.0   27/8/1883
365  1902   10.0  25.0  Santa Maria  ...  Stratovolcano  6.0   2500.0  25/10/1902
386  1912    9.0   6.0    Novarupta  ...        Caldera  6.0      2.0    6/9/1912
650  1991    6.0  15.0     Pinatubo  ...  Stratovolcano  6.0    350.0   15/6/1991

To find the 10 most explosive eruptions, we could filter out those with unknown VEI values before sorting:

In [x]: df[pd.notnull(df['VEI'])].sort_values('VEI').tail(10)[
   ...:              ['date', 'Name', 'Type', 'Country', 'VEI']]
Out[x]: 
          date            Name            Type           Country  VEI
29         653        Dakataua         Caldera  Papua New Guinea  6.0
25         450        Ilopango         Caldera       El Salvador  6.0
22         240         Ksudach   Stratovolcano            Russia  6.0
21         230           Taupo         Caldera       New Zealand  6.0
18          60  Bona-Churchill   Stratovolcano     United States  6.0
99   19/2/1600    Huaynaputina   Stratovolcano              Peru  6.0
1     1750 BCE      Veniaminof   Stratovolcano     United States  6.0
40        1000    Changbaishan   Stratovolcano       North Korea  7.0
218  10/4/1815         Tambora   Stratovolcano         Indonesia  7.0
3     1610 BCE       Santorini  Shield volcano            Greece  7.0

However, there are many entries with a VEI of 6 and their ordering here is not clear. A better approach might be to sort first by VEI and next by deaths, setting na_position='first' to ensure that the null values are placed before numerical values (and therefore effectively rank lowest):

In [x]: df.sort_values(['VEI', 'Deaths'], na_position='first').tail(10)[
   ...:             ['date', 'Name', 'Type', 'Country', 'VEI', 'Deaths']]
Out[x]:
           date          Name             Type           Country  VEI   Deaths
386    6/9/1912     Novarupta          Caldera     United States  6.0      2.0
650   15/6/1991      Pinatubo    Stratovolcano       Philippines  6.0    350.0
99    19/2/1600  Huaynaputina    Stratovolcano              Peru  6.0   1500.0
120        1660   Long Island  Complex volcano  Papua New Guinea  6.0   2000.0
322   27/8/1883      Krakatau          Caldera         Indonesia  6.0   2000.0
365  25/10/1902   Santa Maria    Stratovolcano         Guatemala  6.0   2500.0
25          450      Ilopango          Caldera       El Salvador  6.0  30000.0
3      1610 BCE     Santorini   Shield volcano            Greece  7.0      NaN
40         1000  Changbaishan    Stratovolcano       North Korea  7.0      NaN
218   10/4/1815       Tambora    Stratovolcano         Indonesia  7.0  11000.0

We can also plot some histograms summarizing the data:

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(3.2, 2.46))
df['Day'].hist(bins=31, ax=axes[0][0], grid=False)
axes[0][0].set_xlabel('(a) Day')
df['Month'].hist(bins=np.arange(1, 14) - 0.5, ax=axes[0][1], grid=False)
axes[0][1].set_xticks(range(1, 13))
axes[0][1].set_xlabel('(b) Month')
df[df['Year']>1600]['Year'].hist(ax=axes[1][0], grid=False)
axes[1][0].set_xlabel('(c) Year')
df['Elevation'].hist(ax=axes[1][1], grid=False)
axes[1][1].set_xlabel('(d) Elevation /m')
plt.tight_layout()
plt.show()

Histograms summarizing some of the columns of volcanic event data

Histograms summarizing some of the columns of volcanic event data: (a) day of month; (b) month of year; (c) frequency by year since 1600 – hopefully, volcanic events have been progressively better recorded since 1600 and have not actually increased in frequency; (d) volcano elevation.