The Beautiful Soup Python library is an excellent way to scrape web pages for their content. I recently wanted a reasonably accurate list of official (ISO 3166-1) two-letter codes for countries, but didn't want to pay CHF 38 for the official ISO document. The ISO 3166-1 alpha-2 contains this information in an HTML table which can be scraped quite easily as follows.
First, get a local copy of the Wikipedia article:
import urllib.request url = 'https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2' req = urllib.request.urlopen(url) article = req.read().decode() with open('ISO_3166-1_alpha-2.html', 'w') as fo: fo.write(article)
Then load it and parse it with Beautiful Soup. Extract all the
<table> tags and search for the one with the headings corresponding to the data we want. Finally, iterate over its rows, pulling out the columns we want and writing the cell text to the file
'iso_3166-1_alpha-2_codes.txt'. The file should be interpreted as utf-8 encoded – your browser may or may not realise this.
from bs4 import BeautifulSoup # Load article, turn into soup and get the <table>s. article = open('ISO_3166-1_alpha-2.html').read() soup = BeautifulSoup(article, 'html.parser') tables = soup.find_all('table', class_='sortable') # Search through the tables for the one with the headings we want. for table in tables: ths = table.find_all('th') headings = [th.text.strip() for th in ths] if headings[:5] == ['Code', 'Country name', 'Year', 'ccTLD', 'ISO 3166-2']: break # Extract the columns we want and write to a semicolon-delimited text file. with open('iso_3166-1_alpha-2_codes.txt', 'w') as fo: for tr in table.find_all('tr'): tds = tr.find_all('td') if not tds: continue code, country, year, ccTLD = [td.text.strip() for td in tds[:4]] # Wikipedia does something funny with country names containing # accented characters: extract the correct string form. if '!' in country: country = country[country.index('!')+1:] print('; '.join([code, country, year, ccTLD]), file=fo)