XML Processing Modules in Python

PythonServer Side ProgrammingProgramming

XML stands for "Extensible Markup Language". It is mainly used in webpages, where the data has a specific structure. It has elements, defined by a beginning and an ending tag. A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, are the element's content. Elements can contain other elements, which are called "child elements".

Example

Below is the example of an XML file we are going to use in this tutorial.

<?xml version="1.0"?>
<Tutorials>
   <Tutorial id="Tu101">
      <author>Vicky, Matthew</author>
      <title>Geo-Spatial Data Analysis</title>
      <stream>Python</stream>
      <price>4.95</price>
      <publish_date>2020-07-01</publish_date>
      <description>Learn geo Spatial data Analysis using Python.</description>
   </Tutorial>
   <Tutorial id="Tu102">
      <author>Bolan, Kim</author>
      <title>Data Structures</title>
      <stream>Computer Science</stream>
      <price>12.03</price>
      <publish_date>2020-1-19</publish_date>
      <description>Learn Data structures using different programming lanuages.</description>
   </Tutorial>
   <Tutorial id="Tu103">
      <author>Sora, Everest</author>
      <title>Analytics using Tensorflow</title>
      <stream>Data Science</stream>
      <price>7.11</price>
      <publish_date>2020-1-19</publish_date>
      <description>Learn Data analytics using Tensorflow.</description>
   </Tutorial>
</Tutorials>

Reading xml Using xml.etree.ElementTree

This module provides access to the root of the xml file and then we can access the contents of the inner elements. In the below example we use the attribute called text and get the content of those elements.

Example

import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for xml_elmt in xml_root:
   for inner_elmt in xml_elmt:
      print(inner_elmt.text)

Output

Running the above code gives us the following result −

Tutorial List :
Vicky, Matthew
Geo-Spatial Data Analysis
Python
4.95
2020-07-01
Learn geo Spatial data Analysis using Python.
Bolan, Kim
Data Structures
Computer Science
12.03
2020-1-19
Learn Data structures using different programming lanuages.
Sora, Everest
Analytics using Tensorflow
Data Science
7.11
2020-1-19
Learn Data analytics using Tensorflow.

Getting the xml attributes

We can get the list of attributes and their values in the root tag. Once we find the attributes, it helps us navigate the XML tree easily.

Example

import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for movie in xml_root.iter('Tutorial'):
   print(movie.attrib)

Output

Running the above code gives us the following result −

Tutorial List :
{'id': 'Tu101'}
{'id': 'Tu102'}
{'id': 'Tu103'}

Filtering Results

We can also filter the results out of the xml tree by using the findall() function of this module. In the below example we find out the id of the tutorial which has a price of 12.03.

Example

import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for movie in xml_root.findall("./Tutorial/[price ='12.03']"):
   print(movie.attrib)

Output

Running the above code gives us the following result −

Tutorial List :
{'id': 'Tu102'}

Parsing XML with DOM APIs

We create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file. The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object.

Example

from xml.dom.minidom import parse
import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse('E:\\TutorialsList.xml')
collection = DOMTree.documentElement

# Get all the movies in the collection
tut_list = collection.getElementsByTagName("Tutorial")

print("*****Tutorials*****")
# Print details of each Tutorial.
for tut in tut_list:

   strm = tut.getElementsByTagName('stream')[0]
   print("Stream: ",strm.childNodes[0].data)

   prc = tut.getElementsByTagName('price')[0]
   print("Price: ", prc.childNodes[0].data)

Output

Running the above code gives us the following result −

*****Tutorials*****
Stream: Python
Price: 4.95
Stream: Computer Science
Price: 12.03
Stream: Data Science
Price: 7.11
raja
Published on 25-Jan-2021 07:49:13
Advertisements