problem
You want to extract data from a simple XML document.
solution
You can use the xml.etree.ElementTree module to extract data from simple XML documents. To demonstrate, suppose you want to parse RSS feeds on Planet Python. Here is the corresponding code:
from urllib.request import urlopen
from xml.etree.ElementTree import parse
# Download the RSS feed and parse it
u =urlopen('http://planet.python.org/rss20.xml')
doc =parse(u)
# Extract and output tags of interest
for item in doc.iterfind('channel/item'):
title = item.findtext('title')
date = item.findtext('pubDate')
link = item.findtext('link')print(title)print(date)print(link)print()
Run the above code, the output is similar to this:
Steve Holden: Python for Data Analysis
Mon, 19 Nov 2012 02:13:51 +0000
http://holdenweb.blogspot.com/2012/11/python-for-data-analysis.html
Vasudev Ram: The Python Data model (for v2 and v3)
Sun, 18 Nov 2012 22:06:47 +0000
http://jugad2.blogspot.com/2012/11/the-python-data-model.html
Python Diary: Been playing around with Object Databases
Sun, 18 Nov 2012 20:40:29 +0000
http://www.pythondiary.com/blog/Nov.18,2012/been-…-object-databases.html
Vasudev Ram: Wakari, Scientific Python in the cloud
Sun, 18 Nov 2012 20:19:41 +0000
http://jugad2.blogspot.com/2012/11/wakari-scientific-python-in-cloud.html
Jesse Jiryu Davis: Toro: synchronization primitives for Tornado coroutines
Sun, 18 Nov 2012 20:17:49 +0000
http://feedproxy.google.com/~r/EmptysquarePython/~3/_DOZT2Kd0hQ/
Obviously, if you want to do further processing, you need to replace the print() statement to accomplish other interesting things.
discuss
It is very common to process data in XML encoding format in many applications. Not only because XML has been widely used for data exchange on the Internet, it is also a common format for storing application data (such as word processing, music library, etc.). The following discussion will assume that the reader is already familiar with the basics of XML.
In many cases, when using XML to store only data, the corresponding document structure is very compact and intuitive. For example, the RSS feed in the above example is similar to the following format:
<? xml version="1.0"?<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"<channel
< title Planet Python</title
< link http://planet.python.org/</link
< language en</language
< description Planet Python - http://planet.python.org/</description
< item
< title Steve Holden: Python for Data Analysis</title
< guid http://holdenweb.blogspot.com/...-data-analysis.html</guid
< link http://holdenweb.blogspot.com/...-data-analysis.html</link
< description ...</description
< pubDate Mon,19 Nov 201202:13:51+0000</pubDate
< /item
< item
< title Vasudev Ram: The Python Data model(for v2 and v3)</title
< guid http://jugad2.blogspot.com/...-data-model.html</guid
< link http://jugad2.blogspot.com/...-data-model.html</link
< description ...</description
< pubDate Sun,18 Nov 201222:06:47+0000</pubDate
< /item
< item
< title Python Diary: Been playing around with Object Databases</title
< guid http://www.pythondiary.com/...-object-databases.html</guid
< link http://www.pythondiary.com/...-object-databases.html</link
< description ...</description
< pubDate Sun,18 Nov 201220:40:29+0000</pubDate
< /item
...< /channel
< /rss
xml.etree.ElementTree.parse()
The function parses the entire XML document and converts it into a document object. Then, you can use find(), iterfind(), and findtext() methods to search for specific XML elements. The parameter of these functions is a specified tag name, such as channel/item
or title. Each time you specify a tag, you need to traverse the entire document structure. Each search operation will start from a starting element. Similarly, the tag name specified for each operation is also the relative path of the starting element. For example, execute doc.iterfind('channel/item')
to search for all item elements under the channel element. doc represents the top level of the document (that is, the first-level rss element). Then the next call to item.findtext()
will start the search from the found item element position. Each element in the ElementTree module has some important properties and methods, which are very useful in parsing. The tag attribute contains the name of the tag, the text attribute contains the internal text, and the get()
method can get the attribute value. E.g:
doc
< xml.etree.ElementTree.ElementTree object at 0x101339510
e = doc.find('channel/title')
e
< Element 'title' at 0x10135b310
e.tag
' title'
e.text
' Planet Python'
e.get('some_attribute')
One thing to emphasize is that xml.etree.ElementTree
is not the only way to parse XML. For more advanced applications, you need to consider using lxml
. It uses the same programming interface as ElementTree, so the above example also applies to lxml
. You only need to replace the import statement you just started with from lxml.etree import parse
. lxml
fully complies with the XML standard and is very fast. It also supports features such as validation, XSLT and XPath.
Recommended Posts