Reading data from XML#

XML or Extensible Markup Language is a web markup language that’s similar to HTML but with significant flexibility. It is a meta-language, a language that allows for defining other languages using its mechanics, such as RSS and MathML. XML is also heavily used in regular data exchanges over the web.

Importing libraries and packages#

1# Data gathering
2import xml.etree.ElementTree as ET

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Creating an XML file and reading XML objects#

 1# Creating an XML file using the following command:
 2data = """
 3<person>
 4  <name>Ano</name>
 5  <surname>Nymous</surname>
 6  <phone type="intl">
 7     +33 9999999999
 8   </phone>
 9   <email hide="yes">
10   ano.nymous@gmail.com</email>
11</person>"""
1tree = ET.fromstring(data)
2type(tree)
xml.etree.ElementTree.Element

Finding elements within a tree element#

1print("Name:", tree.find("name").text)
Name: Ano
1# Print the surname
2print("Surname:", tree.find("surname").text)
Surname: Nymous
1# Print the phone number
2print("Phone:", tree.find("phone").text.strip())
Phone: +33 9999999999
1# Print email status and the actual email
2print("Email hidden:", tree.find("email").get("hide"))
3print("Email:", tree.find("email").text.strip())
Email hidden: yes
Email: ano.nymous@gmail.com

Traversing a tree#

1tree2 = ET.parse(f"{data_path}/xml1.xml")
2type(tree2)
xml.etree.ElementTree.ElementTree
1root = tree2.getroot()
2for child in root:
3    print("Child:", child.tag, "| Child attribute:", child.attrib)
Child: country | Child attribute: {'name': 'Liechtenstein'}
Child: country | Child attribute: {'name': 'Singapore'}
Child: country | Child attribute: {'name': 'Panama'}

Using the text method to extract data#

1root[0][2]
<Element 'gdppc' at 0x7f4f5c35c1d0>
1root[0][2].text
'141100'
1root[0][2].tag
'gdppc'
1root[0]
<Element 'country' at 0x7f4f5c2d3270>
1root[0].tag
'country'
1root[0].attrib
{'name': 'Liechtenstein'}