In my previous post I gave a short script for scraping a particular Wikipedia page for some string-based data in one table. Then the internet had some advice for me.
Use pandas.read_html
they said. It will be easy, they said; everything will be handled for you, they said. Just clean, analyse and report.
Here's my first pass. pd.read_html
returns a list of tables, interpreted as Pandas DataFrames, so I need to find the one with the correct headings:
import pandas as pd
tables = pd.read_html('ISO_3166-1_alpha-2.html', header=0)
headings = ['Code', 'Country name', 'Year', 'ccTLD']
for table in tables:
current_headings = table.columns.values[:4]
if current_headings == headings:
break
This doesn't work:
File "get_iso_codes-1.py", line 8, in <module>
if all(current_headings == headings):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Fair enough. Let's try that suggestion:
for table in tables:
current_headings = table.columns.values[:4]
if all(current_headings == headings):
break
Nope:
TypeError: 'bool' object is not iterable
When current_headings
does not have the same number of elements as headings
, it just defaults to a scalar boolean value rather than trying to compare element-by-element and generating an iterable we can pass to all
. We have to either cast back to a list:
for table in tables:
current_headings = table.columns.values[:4]
if list(current_headings) == headings:
break
or check the length first:
for table in tables:
current_headings = table.columns.values[:4]
if len(current_headings) != len(headings):
continue
if all(current_headings == headings):
break
Either way, now we have a DataFrame with data from the correct table, we need to tidy up the fields with accented characters. Pandas has two replace methods that can work with strings. After much head-scratching, I think the summary is that Series.str.replace
accepts a callable for the replacement, but doesn't act in place:
ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
patt = r'.*!(.*)'
repl = lambda m: m.group(1)
ser = ser.str.replace(patt, repl)
produces the desired:
0 Åland Islands
1 Réunion
2 Zimbabwe
whilst Series.replace
doesn't accept a callable but will work in place:
ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
to_replace = r'(.*)!'
ser.replace(to_replace, '', regex=True, inplace=True)
Alternatively, pass a dictionary:
ser.replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)
We need to take care... various plausible alternatives don't work:
ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
to_replace = r'!(.*)'
ser.replace(to_replace, r'\1', regex=True, inplace=True)
print(ser)
just removes the exclamation mark:
0 Aland Islands Åland Islands
1 Reunion Réunion
2 Zimbabwe
(we would need to_replace = r'.*!(.*)'
) Whereas replacing r'\1'
with \1
:
ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
to_replace = r'!(.*)'
ser.replace(to_replace, '\1', regex=True, inplace=True)
print(ser)
does something even more unexpected:
0 Aland Islands
1 Reunion
2 Zimbabwe
(the raw string is needed to interpolate the first matched substring this way). Anyway, now we can write the DataFrame as a semicolon separated file:
table[headings].to_csv('test.txt', sep='; ', header=False, index=False)
No we can't. We're only allowed to use a single character as the field separator. Very well:
table[headings].to_csv('test.txt', sep=';', header=False, index=False)
Done? Not quite. One line in the output is different from the method given in the previous post:
;Namibia;1974;.na
should be
NA;Namibia;1974;.na
Pandas has silently treated the code for Namibia as a missing data value and output an empty string. Back up at the top, we should have used keep_default_na=False
:
tables = pd.read_html('ISO_3166-1_alpha-2.html', header=0, keep_default_na=False)
Now we're done, but the final code is hardly any shorter than doing all this explicitly with Beautiful Soup. Compare:
import pandas as pd
tables = pd.read_html('ISO_3166-1_alpha-2.html', header=0,
keep_default_na=False)
headings = ['Code', 'Country name', 'Year', 'ccTLD']
for table in tables:
current_headings = table.columns.values[:4]
if len(current_headings) != len(headings):
continue
if all(current_headings == headings):
break
table['Country name'].replace({r'.*!(.*)': r'\1'}, regex=True, inplace=True)
table[headings].to_csv('test.txt', sep=';', header=False, index=False)
with:
from bs4 import BeautifulSoup
article = open('ISO_3166-1_alpha-2.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')
for table in tables:
ths = table.find_all('th')
headings = [th.text.strip() for th in ths]
if headings[:5] == ['Code', 'Country name', 'Year', 'ccTLD', 'ISO 3166-2']:
break
with open('iso_3166-1_alpha-2_codes.txt', 'w') as fo:
for tr in table.find_all('tr'):
tds = tr.find_all('td')
if not tds:
continue
code, country, year, ccTLD = [td.text.strip() for td in tds[:4]]
if '!' in country:
country = country[country.index('!')+1:]
print('; '.join([code, country, year, ccTLD]), file=fo)
Comments
Comments are pre-moderated. Please be patient and your comment will appear soon.
There are currently no comments
New Comment