Monday, July 18, 2011

Python and XML Schemas

Python Logo My current project relies on a large number of XML Schema definition files. There are 1,600 types defined in various schemas, with actions for each type to be implemented as part of the project. A previous article examined CodeSynthesis XSD for C++ code generation from an XML Schema. This time we'll examine two packages for Python, GenerateDS and PyXB. Both were chosen based on their ability to feature prominently in search results.

In this article we'll work with the following schema and input data, the same used in the previous C++ discussion. It is my HR database of minions, for use when I become the Evil Overlord.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="minion">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="rank" type="xs:string"/>
      <xs:element name="serial" type="xs:positiveInteger"/>
    </xs:sequence>
    <xs:attribute name="loyalty" type="xs:float" use="required"/>
  </xs:complexType>
</xs:element>

</xs:schema>


<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<minion xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:noNamespaceSchemaLocation="schema.xsd" loyalty="0.2">
  <name>Agent Smith</name>
  <rank>Member of Minion Staff</rank>
  <serial>2</serial>
</minion>

The Python ElementTree can handle XML documents, so why generate code at all? One reason is simple readability.

Generated CodeElementTree
m.name m.find("name").text

A more subtle reason is to catch errors earlier. Because working with the underlying XML relies on passing in the node name as a string, a typo or misunderstanding of the XML schema will result in not finding the desired element and/or an exception. This is what unit tests are supposed to catch, but as the same developer implements the code and the unit test it is unlikely to catch a misinterpretation of the schema. With generated code, we can use static analysis tools like pylint to catch errors.


 

GenerateDS

The generateDS python script processes the XML schema:

python generateDS.py -o minion.py -s minionsubs.py minion.xsd

The generated code is in minion.py, while minionsubs.py contains an empty class definition for a subclass of minion. The generated class uses ElementTree for XML support, which is in the standard library in recent versions of Python. The minion class has properties for each node and attribute defined in the XSD. In our example this includes name, rank, serial, and loyalty.

import minion_generateds
if __name__ == '__main__':
  m = minion.parse("minion.xml")
  print '%s: %s, #%d (%f)' % (m.name, m.rank, m.serial, m.loyalty)

 

PyXB

The pyxbgen utility processes the XML schema:

pyxbgen -u minion.xsd -m minion

The generated code is in minion.py. The PyXB file is only 106 lines long, compared with 548 lines for GenerateDS. This doesn't tell the whole story, as the PyXB generated code imports the pyxb module where the generateDS code only depends on system modules. The pyxb package has to be pushed to production.

Very much like generateDS, the PyXB class has properties for each node and attribute defined in the XSD.

import minion_pyxb
if __name__ == '__main__':
  xml = file('minion.xml').read()
  m = minion.CreateFromDocument(xml)
  print '%s: %s, #%d (%f)' % (m.name, m.rank, m.serial, m.loyalty)

 

Pylint results

A primary reason for this exercise is to catch XML-related errors at build time, rather than exceptions in production. I don't believe unit tests are an effective way to verify that a developer has understood the XML schema.

To test this, a bogus 'm.fooberry' property reference was added to both test programs. pylint properly flagged a warning for the generateDS code.

E: 15: Instance of 'minion' has no 'fooberry' member (but some types could not be inferred)

pylint did not flag the error in the PyDB test code. I believe this is because PyDB doesn't name the generated class minion, instead it is named CTD_ANON with a runtime binding within its framework to "minion." pylint is doing a purely static analysis, and this kind of arrangement is beyond its ken.

class CTD_ANON (pyxb.binding.basis.complexTypeDefinition):
  ...

minion = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace,
           u'minion'), CTD_ANON)

 

Conclusion

As a primary goal of this effort is error detection via static analysis, we'll go with generateDS.