Reading HTML tables#

Pandas includes an HTML parser that processes the HTML content of a given page and tries to extract various tables from the page. More on that in data gathering

Importing libraries and packages#

1# Mathematical operations and data manipulation
2import pandas as pd

Loading dataset#

Greenhouse gas emissions by country#

1url = "https://en.wikipedia.org/wiki/Greenhouse_gas_emissions"
2list_of_df = pd.read_html(url, header=0)
3dataset_15 = list_of_df[0]
4dataset_15.head()
Technology Min. Median Max.
0 Currently commercially available technologies Currently commercially available technologies Currently commercially available technologies Currently commercially available technologies
1 Coal – PC 740 820 910
2 Gas – combined cycle 410 490 650
3 Biomass – Dedicated 130 230 420
4 Solar PV – Utility scale 18 48 180

Wrangling greenhouse tables#

1# Checking the length of the list returned
2len(list_of_df)
22
1# Looking for a particular table
2for table in list_of_df:
3    print(table.shape)
(14, 4)
(22, 3)
(16, 4)
(10, 5)
(212, 5)
(0, 2)
(7, 1)
(38, 2)
(1, 2)
(3, 2)
(1, 2)
(1, 2)
(5, 2)
(3, 2)
(4, 2)
(2, 2)
(5, 2)
(3, 2)
(4, 2)
(2, 2)
(1, 2)
(1, 2)

Looks like the fifth element in the list is “2019 Fossil CO2 emissions by country”

1# Extract the fifth element from the table
2dataset_16 = list_of_df[4]
3dataset_16.head()
Country total emissions(Mton) Share(%) per capita(ton) per GDP(ton/k$)
0 Global Total 38016.57 100.00 4.93 0.29
1 China 11535.20 30.34 8.12 0.51
2 United States 5107.26 13.43 15.52 0.25
3 EU27+UK 3303.97 8.69 6.47 0.14
4 India 2597.36 6.83 1.90 0.28