A simple webscraping script with pandas

Learning Scientific Programming with Python (2nd edition)

E9.9: A simple webscraping script with pandas

At the time of writing, the first table on the Wikipedia page https://en.wikipedia.org/wiki/List_of_wine-producing_regions contains columns of the rank, country name and wine production for the principal wine-producing countries in the world. To parse it with pandas:

In [x]: dfs = pd.read_html(
              'https://en.wikipedia.org/wiki/List_of_wine-producing_regions',
              index_col=1, match="Wine production by country") 
In [x]: dfs[0].head()

The output is:

Out[x]:
                                    Rank  Production(tonnes)
Country(with link to wine article)                          
Italy                                  1             4796900
France                                 2             4607850
Spain                                  3             4293466
United States                          4             3300000
China                                  5             1700000

In this case, the table is identified by a match to the the text inside the <caption> element of the first <table> on the page.

dfs is a list containing a single item, the DataFrame parsed from the matching table.