Testing Equivalence of xml.etree.ElementTree

Posted on

Question :

Testing Equivalence of xml.etree.ElementTree

I’m interested in equivalence of two xml elements; and I’ve found that testing the tostring of the elements works; however, that seems hacky.

Is there a better way to test equivalence of two etree Elements?

Comparing Elements directly:

import xml.etree.ElementTree as etree
h1 = etree.Element('hat',{'color':'red'})
h2 = etree.Element('hat',{'color':'red'})

h1 == h2  # False

Comparing Elements as strings:

etree.tostring(h1) == etree.tostring(h2)  # True

Answer #1:

This compare function works for me:

def elements_equal(e1, e2):
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))
Answered By: Itamar

Answer #2:

Comparing strings doesn’t always work. The order of the attributes should not matter for considering two nodes equivalent. However, if you do string comparison, the order obviously matters.

I’m not sure if it is a problem or a feature, but my version of lxml.etree preserves the order of the attributes if they are parsed from a file or a string:

>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)

This might be version-dependent (I use Python 2.7.3 with lxml.etree 2.3.2 on Ubuntu); I remember that I couldn’t find a way of controlling the order of the attributes a year ago or so, when I wanted to (for readability reasons).

As I need to compare XML files that were produced by different serializers, I see no other way than recursively comparing tag, text, attributes, and children of every node. And of course tail, if there’s anything interesting there.

Comparison of lxml and xml.etree.ElementTree

The truth is that it may be implementation dependent. Apparently, lxml uses ordered dict or something like that, the standard xml.etree.ElementTree does not preserve the order of attributes:

Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)
>>> etree.tostring(h1)
'<hat color="blue" price="39.90"/>'
>>> etree.tostring(h2)
'<hat price="39.90" color="blue"/>'
>>> etree.dump(h1)
<hat color="blue" price="39.90"/>>>> etree.dump(h2)
<hat price="39.90" color="blue"/>>>>

(Yes, the newlines are missing. But it is a minor problem.)

>>> import xml.etree.ElementTree as ET
>>> h1 = ET.XML('<hat color="blue" price="39.90"/>')
>>> h1
<Element 'hat' at 0x2858978>
>>> h2 = ET.XML('<hat price="39.90" color="blue"/>')
>>> ET.dump(h1)
<hat color="blue" price="39.90" />
>>> ET.dump(h2)
<hat color="blue" price="39.90" />
>>> ET.tostring(h1) == ET.tostring(h2)
>>> ET.dump(h1) == ET.dump(h2)
<hat color="blue" price="39.90" />
<hat color="blue" price="39.90" />

Another question may be what is considered unimportant whan comparing. For example, some fragments may contain extra spaces and we do not want to care. This way, it is always better to write some serializing function that works exactly we need.

Answered By: lenz

Answer #3:

Serializing and deserializing won’t work for XML because attributes are not order dependent (and other reasons) E.g. these two elements are logically the same, but different strings:

<THING a="foo" b="bar"></THING>
<THING b="bar" a="foo"  />

Exactly how to do an element comparison is tricky. As far as I can tell, there is nothing built into Element Tree to do this for you. I needed to do this myself, and used the code below. It works for my needs, but its not suitable for large XML structures and is not fast or efficient! This is an ordering function rather than an equality function, so a result of 0 is equal and anything else is not. Wrapping it with a True or False returning function is left as an exercise for the reader!

def cmp_el(a,b):
    if a.tag < b.tag:
        return -1
    elif a.tag > b.tag:
        return 1
    elif a.tail < b.tail:
        return -1
    elif a.tail > b.tail:
        return 1

    #compare attributes
    aitems = a.attrib.items()
    bitems = b.attrib.items()
    if aitems < bitems:
        return -1
    elif aitems > bitems:
        return 1

    #compare child nodes
    achildren = list(a)
    bchildren = list(b)

    for achild, bchild in zip(achildren, bchildren):
        cmpval = cmp_el(achild, bchild)
        if  cmpval < 0:
            return -1
        elif cmpval > 0:
            return 1    

    #must be equal 
    return 0
Answered By: afaulconbridge

Answer #4:

Believe it or not that is actually the best way to handle comparing two nodes if you don’t know how many children each may have and you want to include all children in the search.

Of course, if you simply have a childless node like the one you are demonstrating, you can simply compare the tag, attrib, and tail properties:

if h1.tag == h2.tag and h1.attrib == h2.attrib and h1.tail == h2.tail:
    print("h1 and h2 are the same")
    print("h1 and h2 are the different")

I don’t see any major benefit of this over using tostring, however.

Answered By: cwallenpoole

Answer #5:

An usual way to compare complex structures is to dump them in a common unique textual representation and compare the resulting strings for equality.

To compare two received json strings, you would convert them to json objects, and then convert them back to strings (with the same convertor) and compare. I did it to check json feeds, it works well.

For XML, it is almost the same, but you may have to handle (strip? remove?) the “.text” parts (the text, blank or not, that may be found outside tags).

So in short, your solution is not a hack, as long as you make sure two equivalent XMLs (according to your context) will have the same string representation.

Answered By: gb.

Answer #6:

Do not gold plate. The one you have is a good comparison. At the end XML it is TEXT.

Answered By: fabrizioM

Leave a Reply

Your email address will not be published. Required fields are marked *