The file volcanic-eruptions.csv
contains data concerning 822 significant volcanic events on Earth between 1750 BCE and 2020 CE from the US National Centres for Environmental Information (NCEI). The information on each event is given in comma-separated fields and includes date, volcano name, location, type, estimated number of human deaths and "Volcanic Explosivity Index" (VEI).
The data are readily parsed into a DataFrame
with:
In [x]: df = pd.read_csv('volcanic-eruptions.csv', index_col=0)
The most deadly volcanic eruption in the database is that of Ilopango, around the middle of the fifth century CE:
In [x]: df.loc[df['Deaths'].idxmax()]
Out[x]:
Year 450
Month NaN
Day NaN
Name Ilopango
Location El Salvador
Country El Salvador
Latitude 13.672
Longitude -89.053
Elevation 450
Type Caldera
VEI 6
Deaths 30000
Name: 25, dtype: object
It would be helpful to have a column with the day, month and year of the explosion parsed into a string. Define a helper function, get_date
:
def get_date(year, month, day):
if year < 0:
s_year = f'{-year} BCE'
else:
s_year = str(year)
if pd.isnull(month):
return s_year
s_date = f'{int(month)}/{s_year}'
if pd.isnull(day):
return s_date
return f'{int(day)}/{s_date}'
and apply it to the DataFrame
:
In [x]: df['date'] = [get_date(year, month, day) for year, month, day in
zip(df['Year'], df['Month'], df['Day'])]
Simple filtering can give us a list of eruptions with a VEI of at least 6 since the start of the nineteenth century:
In [x]: df[(df['VEI'] >= 6) & (df['Year'] >= 1800)]
Out[x]:
Year Month Day Name ... Type VEI Deaths date
218 1815 4.0 10.0 Tambora ... Stratovolcano 7.0 11000.0 10/4/1815
322 1883 8.0 27.0 Krakatau ... Caldera 6.0 2000.0 27/8/1883
365 1902 10.0 25.0 Santa Maria ... Stratovolcano 6.0 2500.0 25/10/1902
386 1912 9.0 6.0 Novarupta ... Caldera 6.0 2.0 6/9/1912
650 1991 6.0 15.0 Pinatubo ... Stratovolcano 6.0 350.0 15/6/1991
To find the 10 most explosive eruptions, we could filter out those with unknown VEI values before sorting:
In [x]: df[pd.notnull(df['VEI'])].sort_values('VEI').tail(10)[
...: ['date', 'Name', 'Type', 'Country', 'VEI']]
Out[x]:
date Name Type Country VEI
29 653 Dakataua Caldera Papua New Guinea 6.0
25 450 Ilopango Caldera El Salvador 6.0
22 240 Ksudach Stratovolcano Russia 6.0
21 230 Taupo Caldera New Zealand 6.0
18 60 Bona-Churchill Stratovolcano United States 6.0
99 19/2/1600 Huaynaputina Stratovolcano Peru 6.0
1 1750 BCE Veniaminof Stratovolcano United States 6.0
40 1000 Changbaishan Stratovolcano North Korea 7.0
218 10/4/1815 Tambora Stratovolcano Indonesia 7.0
3 1610 BCE Santorini Shield volcano Greece 7.0
However, there are many entries with a VEI of 6 and their ordering here is not clear. A better approach might be to sort first by VEI and next by deaths, setting na_position='first'
to ensure that the null values are placed before numerical values (and therefore effectively rank lowest):
In [x]: df.sort_values(['VEI', 'Deaths'], na_position='first').tail(10)[
...: ['date', 'Name', 'Type', 'Country', 'VEI', 'Deaths']]
Out[x]:
date Name Type Country VEI Deaths
386 6/9/1912 Novarupta Caldera United States 6.0 2.0
650 15/6/1991 Pinatubo Stratovolcano Philippines 6.0 350.0
99 19/2/1600 Huaynaputina Stratovolcano Peru 6.0 1500.0
120 1660 Long Island Complex volcano Papua New Guinea 6.0 2000.0
322 27/8/1883 Krakatau Caldera Indonesia 6.0 2000.0
365 25/10/1902 Santa Maria Stratovolcano Guatemala 6.0 2500.0
25 450 Ilopango Caldera El Salvador 6.0 30000.0
3 1610 BCE Santorini Shield volcano Greece 7.0 NaN
40 1000 Changbaishan Stratovolcano North Korea 7.0 NaN
218 10/4/1815 Tambora Stratovolcano Indonesia 7.0 11000.0
We can also plot some histograms summarizing the data:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(3.2, 2.46))
df['Day'].hist(bins=31, ax=axes[0][0], grid=False)
axes[0][0].set_xlabel('(a) Day')
df['Month'].hist(bins=np.arange(1, 14) - 0.5, ax=axes[0][1], grid=False)
axes[0][1].set_xticks(range(1, 13))
axes[0][1].set_xlabel('(b) Month')
df[df['Year']>1600]['Year'].hist(ax=axes[1][0], grid=False)
axes[1][0].set_xlabel('(c) Year')
df['Elevation'].hist(ax=axes[1][1], grid=False)
axes[1][1].set_xlabel('(d) Elevation /m')
plt.tight_layout()
plt.show()
Histograms summarizing some of the columns of volcanic event data: (a) day of month; (b) month of year; (c) frequency by year since 1600 – hopefully, volcanic events have been progressively better recorded since 1600 and have not actually increased in frequency; (d) volcano elevation.