Reading HTML tables#
Pandas includes an HTML parser that processes the HTML content of a given page and tries to extract various tables from the page. More on that in data gathering
Importing libraries and packages#
1# Mathematical operations and data manipulation
2import pandas as pd
Loading dataset#
Greenhouse gas emissions by country#
1url = "https://en.wikipedia.org/wiki/Greenhouse_gas_emissions"
2list_of_df = pd.read_html(url, header=0)
3dataset_15 = list_of_df[0]
4dataset_15.head()
Technology | Min. | Median | Max. | |
---|---|---|---|---|
0 | Currently commercially available technologies | Currently commercially available technologies | Currently commercially available technologies | Currently commercially available technologies |
1 | Coal – PC | 740 | 820 | 910 |
2 | Gas – combined cycle | 410 | 490 | 650 |
3 | Biomass – Dedicated | 130 | 230 | 420 |
4 | Solar PV – Utility scale | 18 | 48 | 180 |
Wrangling greenhouse tables#
1# Checking the length of the list returned
2len(list_of_df)
22
1# Looking for a particular table
2for table in list_of_df:
3 print(table.shape)
(14, 4)
(22, 3)
(16, 4)
(10, 5)
(212, 5)
(0, 2)
(7, 1)
(38, 2)
(1, 2)
(3, 2)
(1, 2)
(1, 2)
(5, 2)
(3, 2)
(4, 2)
(2, 2)
(5, 2)
(3, 2)
(4, 2)
(2, 2)
(1, 2)
(1, 2)
Looks like the fifth element in the list is “2019 Fossil CO2 emissions by country”
1# Extract the fifth element from the table
2dataset_16 = list_of_df[4]
3dataset_16.head()
Country | total emissions(Mton) | Share(%) | per capita(ton) | per GDP(ton/k$) | |
---|---|---|---|---|---|
0 | Global Total | 38016.57 | 100.00 | 4.93 | 0.29 |
1 | China | 11535.20 | 30.34 | 8.12 | 0.51 |
2 | United States | 5107.26 | 13.43 | 15.52 | 0.25 |
3 | EU27+UK | 3303.97 | 8.69 | 6.47 | 0.14 |
4 | India | 2597.36 | 6.83 | 1.90 | 0.28 |