The Beautiful Soup Python library is an excellent way to scrape web pages for their content. I recently wanted a reasonably accurate list of official (ISO 3166-1) two-letter codes for countries, but didn't want to pay CHF 38 for the official ISO document. The ISO 3166-1 alpha-2 contains this information in an HTML table which can be scraped quite easily as follows.
First, get a local copy of the Wikipedia article:
import urllib.request
url = 'https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2'
req = urllib.request.urlopen(url)
article = req.read().decode()
with open('ISO_3166-1_alpha-2.html', 'w') as fo:
fo.write(article)
Then load it and parse it with Beautiful Soup. Extract all the <table>
tags and search for the one with the headings corresponding to the data we want. Finally, iterate over its rows, pulling out the columns we want and writing the cell text to the file 'iso_3166-1_alpha-2_codes.txt'
. The file should be interpreted as utf-8 encoded – your browser may or may not realise this.
from bs4 import BeautifulSoup
# Load article, turn into soup and get the <table>s.
article = open('ISO_3166-1_alpha-2.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')
# Search through the tables for the one with the headings we want.
for table in tables:
ths = table.find_all('th')
headings = [th.text.strip() for th in ths]
if headings[:5] == ['Code', 'Country name', 'Year', 'ccTLD', 'ISO 3166-2']:
break
# Extract the columns we want and write to a semicolon-delimited text file.
with open('iso_3166-1_alpha-2_codes.txt', 'w') as fo:
for tr in table.find_all('tr'):
tds = tr.find_all('td')
if not tds:
continue
code, country, year, ccTLD = [td.text.strip() for td in tds[:4]]
# Wikipedia does something funny with country names containing
# accented characters: extract the correct string form.
if '!' in country:
country = country[country.index('!')+1:]
print('; '.join([code, country, year, ccTLD]), file=fo)
Comments
Comments are pre-moderated. Please be patient and your comment will appear soon.
Radish 5 years, 10 months ago
Thanks - very useful! I'm something of a beginner with Python and have adapted this to what I'm working on. How would you go about putting this into a pandas dataframe?
Link | Replychristian 5 years, 10 months ago
Hello, Radish,
Link | ReplyThere is a follow-up post https://scipython.com/blog/scraping-a-wikipedia-table-with-pandas/ that deals with using Pandas.
Christian
Sabika 5 years, 10 months ago
Hi Christian , I have a same requirement of reading the table contents to pandas dataframe . The solution u have provided is reading table contents using data frame and not storing in dataframe . If u have something similar please help .
Link | ReplyHi Radish , if u have found any such solution kindly revert .
Thanks in advance !
rafael 4 years, 8 months ago
I know that it is too late for you but for the others that might have the same problem, use this after the CONTINUE:
Link | Replydata = Postcode, Borough, Neighbourhood = [td.text.strip() for td in tds[:3]]
ls.append(data)
And outside the for loop:
df=pd.DataFrame(ls,columns=cols)
df
christian 5 years, 10 months ago
Do you mean you wish to read the semicolon-delimited text file, iso_3166-1_alpha-2_codes.txt, into a Pandas dataframe? Does this work:
Link | Replyimport pandas as pd
df = pd.read_csv('iso_3166-1_alpha-2_codes.txt', sep=';', header=None, names=['code', 'country', 'year', 'ccTLD'])
New Comment