Reading HTML tables#

Pandas includes an HTML parser that processes the HTML content of a given page and tries to extract various tables from the page. More on that in data gathering

Importing libraries and packages#

# Mathematical operations and data manipulation
import pandas as pd

Loading dataset#

Greenhouse gas emissions by country#

url = "https://en.wikipedia.org/wiki/Greenhouse_gas_emissions"
list_of_df = pd.read_html(url, header=0)
dataset_15 = list_of_df[0]
dataset_15.head()

	Technology	Min.	Median	Max.
0	Currently commercially available technologies	Currently commercially available technologies	Currently commercially available technologies	Currently commercially available technologies
1	Coal – PC	740	820	910
2	Gas – combined cycle	410	490	650
3	Biomass – Dedicated	130	230	420
4	Solar PV – Utility scale	18	48	180

Wrangling greenhouse tables#

# Checking the length of the list returned
len(list_of_df)

# Looking for a particular table
for table in list_of_df:
    print(table.shape)

(14, 4)
(22, 3)
(16, 4)
(10, 5)
(212, 5)
(0, 2)
(7, 1)
(38, 2)
(1, 2)
(3, 2)
(1, 2)
(1, 2)
(5, 2)
(3, 2)
(4, 2)
(2, 2)
(5, 2)
(3, 2)
(4, 2)
(2, 2)
(1, 2)
(1, 2)

Looks like the fifth element in the list is “2019 Fossil CO2 emissions by country”

# Extract the fifth element from the table
dataset_16 = list_of_df[4]
dataset_16.head()

	Country	total emissions(Mton)	Share(%)	per capita(ton)	per GDP(ton/k$)
0	Global Total	38016.57	100.00	4.93	0.29
1	China	11535.20	30.34	8.12	0.51
2	United States	5107.26	13.43	15.52	0.25
3	EU27+UK	3303.97	8.69	6.47	0.14
4	India	2597.36	6.83	1.90	0.28