mdesjardins.github.io

(Continued from my last post)
So, the first thing I needed to do when creating my weather applet for Avant Window Navigator was actually parse weather data from a weather source. After messing around with Google’s weather API for a while, I decided to use weather.com‘s web service. weather.com has a well-documented, straightforward, predictable XML API. To parse the XML, I chose minidom. Minidom is a “Lightweight DOM Implementation.” Here’s how it works: Let’s say you have an XML document that supplies a pizza menu, at some URL. Here’s the XML:

In the python script that will be parsing this, you’d want to import the minidom package. Let’s assume that the above XML is served by the URL http://menu.pizzaplace.us, so you’ll want to import urllib as well. The python code to read up the XML Document might look like the following:

from xml.dom import minidom
import urllib
import sys
try:
  usock = urllib.urlopen("http://menu.pizzaplace.us")
  xmldoc = minidom.parse(usock)
  usock.close()
except:
  print "Something really bad happened! ", sys.exc_info()[0]

Easy, right? Now we want to get the actual data out of the Pizza Menu. Everything in your DOM tree is a Node. This includes text between element tags. In fact, in minidom, the whitespace between strings of text is a node, too (more on that in a minute!). To fetch nodes, you use the getElementsByTagName function. This function returns a List of nodes with matching element tag names. Another handy function is getAttribute. As you might expect, it returns the value for an attribute on a particular element.

Let’s say we want to iterate through all of the pizzas on the pizza-menu, printing the type of pizza. That code would look like this:

from xml.dom import minidom
import urllib
try:
  usock = urllib.urlopen("http://menu.pizzaplace.us")
  xmldoc = minidom.parse(usock
  usock.close()
  pizza_list = xmldoc.getElementsByTagName('pizza')
  for pizza_element in pizza_list:
    pizza_type = pizza_element.getAttribute('type')
    print 'Pizza Type: %s' % pizza_type
except:
  print "Something really bad happened! ", sys.exc_info()[0]

Next, let’s pretend that “heart-attack-special” pizza sounds really appetizing, and we want to estimate just how much our cholesterol count will spike if we have a slice. We probably want to iterate over the toppings on that pizza to perform that evaluation. To that end, we will hunt for the pizza with the type “heart-attack-special”, grab that node, then iterate over the topping sub-nodes. Here’s how we would do that:

from xml.dom import minidom
import urllib
try:
  usock = urllib.urlopen("http://menu.pizzaplace.us")
  xmldoc = minidom.parse(usock)
  usock.close()
  pizza_list = xmldoc.getElementsByTagName("pizza")
  for pizza_element in pizza_list:
    pizza_type = pizza_element.getAttribute("type")
    print 'Pizza Type: %s' % pizza_type
    if pizza_type == 'heart-attack-special':
      topping_list = pizza_element.getElementsByName('topping')
      for topping_element in topping_list:
        # (do something here)
except:
  print "Something really bad happened! ", sys.exc_info()[0]

As you can see, the pizza_element is a node like any other node, so you can call getElementsByName on it to get any child nodes of this pizza element. The toppings (pepperoni, sausage, hamburg, canadian bacon, and ham) are themselves child nodes of their respective elements. Each node has a nodeType property which describes the nature of that node. The nodeTypes are TEXT_NODE, ELEMENT_NODE, ATTRIBUTE_NODE, and DOCUMENT_NODE. Thus, the word “pepperoni” is a child node of the first topping node, and is of type TEXT_NODE.

You might be surprised to learn that the fourth topping node on the heart-attack-special is comprised of three child text nodes. The text “canadian bacon” has a child with the value bacon, a child with a single character of whitespace, and a child with the value bacon. This is not usually how we want to access the data in our XML documents; we’d prefer that “canadian bacon” be treated as a single node comprised of one string.

To make the data behave the way we expect it to, we can introduce our own simple utility method called getText. This function concatenates all child nodes of the supplied node list which are of type TEXT_NODE. It looks like this:

def getText(nodelist):
  rc = ""
  for node in nodelist:
  if node.nodeType == node.TEXT_NODE:
    rc = rc + node.data
  return rc

To use it, we’d pass it the parent node of the text we’re interested in. Going back to our original example, we can use the getText function to print out each topping on our heart-attack-special pizza:

from xml.dom import minidom
import urllib
try:
  usock = urllib.urlopen("http://menu.pizzaplace.us")
  xmldoc = minidom.parse(usock)
  usock.close()
  pizza_list = xmldoc.getElementsByTagName("pizza")
  for pizza_element in pizza_list:
    pizza_type = pizza_element.getAttribute("type")
    print 'Pizza Type: %s' % pizza_type
    if pizza_type == 'heart-attack-special':
      topping_list = pizza_element.getElementsByName('topping')
      for topping_element in topping_list:
        topping_text = getText(topping_element)
        print " Topping %s" % topping_text
except:
  print "Something really bad happened! ", sys.exc_info()[0]

The XML-parsing portions of the weather applet that I wrote for the Avant Window Navigator aren’t much more complicated than this. You can download the source code for the weather applet here. The parts which parse weather.com’s data are in the weather.py script, in the get_conditions and get_forecast functions.