Scraping a Wikipedia table with Pandas

Posted on 25 April 2018

In my previous post I gave a short script for scraping a particular Wikipedia page for some string-based data in one table. Then the internet had some advice for me. Use pandas.read_html they said. It will be easy, they said; everything will be handled for you, they said. Just clean, analyse and report.

Here's my first pass. pd.read_html returns a list of tables, interpreted as Pandas DataFrames, so I need to find the one with the correct headings:

import pandas as pd

tables = pd.read_html('ISO_3166-1_alpha-2.html', header=0)

headings = ['Code', 'Country name', 'Year', 'ccTLD']
for table in tables:
    current_headings = table.columns.values[:4]
    if current_headings == headings:
        break

This doesn't work:

  File "get_iso_codes-1.py", line 8, in <module>
    if all(current_headings == headings):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Fair enough. Let's try that suggestion:

for table in tables:
    current_headings = table.columns.values[:4]
    if all(current_headings == headings):
        break

Nope:

TypeError: 'bool' object is not iterable

When current_headings does not have the same number of elements as headings, it just defaults to a scalar boolean value rather than trying to compare element-by-element and generating an iterable we can pass to all. We have to either cast back to a list:

for table in tables:
    current_headings = table.columns.values[:4]
    if list(current_headings) == headings:
        break

or check the length first:

for table in tables:
    current_headings = table.columns.values[:4]
    if len(current_headings) != len(headings):
        continue
    if all(current_headings == headings):
        break

Either way, now we have a DataFrame with data from the correct table, we need to tidy up the fields with accented characters. Pandas has two replace methods that can work with strings. After much head-scratching, I think the summary is that Series.str.replace accepts a callable for the replacement, but doesn't act in place:

ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
patt = r'.*!(.*)'
repl = lambda m: m.group(1)
ser = ser.str.replace(patt, repl)

produces the desired:

0    Åland Islands
1          Réunion
2         Zimbabwe

whilst Series.replace doesn't accept a callable but will work in place:

ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
to_replace = r'(.*)!'
ser.replace(to_replace, '', regex=True, inplace=True)

Alternatively, pass a dictionary:

ser.replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)

We need to take care... various plausible alternatives don't work:

ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
to_replace = r'!(.*)'
ser.replace(to_replace, r'\1', regex=True, inplace=True)
print(ser)

just removes the exclamation mark:

0    Aland Islands Åland Islands
1                Reunion Réunion
2                       Zimbabwe

(we would need to_replace = r'.*!(.*)') Whereas replacing r'\1' with \1:

ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
to_replace = r'!(.*)'
ser.replace(to_replace, '\1', regex=True, inplace=True)
print(ser)

does something even more unexpected:

0    Aland Islands 
1          Reunion 
2           Zimbabwe

(the raw string is needed to interpolate the first matched substring this way). Anyway, now we can write the DataFrame as a semicolon separated file:

table[headings].to_csv('test.txt', sep='; ', header=False, index=False)

No we can't. We're only allowed to use a single character as the field separator. Very well:

table[headings].to_csv('test.txt', sep=';', header=False, index=False)

Done? Not quite. One line in the output is different from the method given in the previous post:

;Namibia;1974;.na

should be

NA;Namibia;1974;.na

Pandas has silently treated the code for Namibia as a missing data value and output an empty string. Back up at the top, we should have used keep_default_na=False:

tables = pd.read_html('ISO_3166-1_alpha-2.html', header=0, keep_default_na=False)

Now we're done, but the final code is hardly any shorter than doing all this explicitly with Beautiful Soup. Compare:

import pandas as pd

tables = pd.read_html('ISO_3166-1_alpha-2.html', header=0,
                      keep_default_na=False)

headings = ['Code', 'Country name', 'Year', 'ccTLD']
for table in tables:
    current_headings = table.columns.values[:4]
    if len(current_headings) != len(headings):
        continue
    if all(current_headings == headings):
        break

table['Country name'].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)
table[headings].to_csv('test.txt', sep=';', header=False, index=False)

with:

from bs4 import BeautifulSoup

article = open('ISO_3166-1_alpha-2.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:5] == ['Code', 'Country name', 'Year', 'ccTLD', 'ISO 3166-2']:
        break

with open('iso_3166-1_alpha-2_codes.txt', 'w') as fo:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        code, country, year, ccTLD = [td.text.strip() for td in tds[:4]]
        if '!' in country:
            country = country[country.index('!')+1:]
        print('; '.join([code, country, year, ccTLD]), file=fo)