Scraping a Wikipedia table with Beautiful Soup

(4 comments)

The Beautiful Soup Python library is an excellent way to scrape web pages for their content. I recently wanted a reasonably accurate list of official (ISO 3166-1) two-letter codes for countries, but didn't want to pay CHF 38 for the official ISO document. The ISO 3166-1 alpha-2 contains this information in an HTML table which can be scraped quite easily as follows.

First, get a local copy of the Wikipedia article:

import urllib.request

url = 'https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2'
req = urllib.request.urlopen(url)
article = req.read().decode()

with open('ISO_3166-1_alpha-2.html', 'w') as fo:
    fo.write(article)

Then load it and parse it with Beautiful Soup. Extract all the <table> tags and search for the one with the headings corresponding to the data we want. Finally, iterate over its rows, pulling out the columns we want and writing the cell text to the file 'iso_3166-1_alpha-2_codes.txt'. The file should be interpreted as utf-8 encoded – your browser may or may not realise this.

from bs4 import BeautifulSoup

# Load article, turn into soup and get the <table>s.
article = open('ISO_3166-1_alpha-2.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

# Search through the tables for the one with the headings we want.
for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:5] == ['Code', 'Country name', 'Year', 'ccTLD', 'ISO 3166-2']:
        break

# Extract the columns we want and write to a semicolon-delimited text file.
with open('iso_3166-1_alpha-2_codes.txt', 'w') as fo:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        code, country, year, ccTLD = [td.text.strip() for td in tds[:4]]
        # Wikipedia does something funny with country names containing
        # accented characters: extract the correct string form.
        if '!' in country:
            country = country[country.index('!')+1:]
        print('; '.join([code, country, year, ccTLD]), file=fo)
Current rating: 3.3

Comments

Comments are pre-moderated. Please be patient and your comment will appear soon.

Radish 10 months, 3 weeks ago

Thanks - very useful! I'm something of a beginner with Python and have adapted this to what I'm working on. How would you go about putting this into a pandas dataframe?

Link | Reply
Currently unrated

christian 10 months, 3 weeks ago

Hello, Radish,
There is a follow-up post https://scipython.com/blog/scraping-a-wikipedia-table-with-pandas/ that deals with using Pandas.
Christian

Link | Reply
Current rating: 5

Sabika 10 months, 3 weeks ago

Hi Christian , I have a same requirement of reading the table contents to pandas dataframe . The solution u have provided is reading table contents using data frame and not storing in dataframe . If u have something similar please help .

Hi Radish , if u have found any such solution kindly revert .

Thanks in advance !

Link | Reply
Currently unrated

christian 10 months, 3 weeks ago

Do you mean you wish to read the semicolon-delimited text file, iso_3166-1_alpha-2_codes.txt, into a Pandas dataframe? Does this work:

import pandas as pd

df = pd.read_csv('iso_3166-1_alpha-2_codes.txt', sep=';', header=None, names=['code', 'country', 'year', 'ccTLD'])

Link | Reply
Current rating: 5

New Comment

required

required (not published)

optional

required