Question :
I’m interested in equivalence of two xml elements; and I’ve found that testing the tostring of the elements works; however, that seems hacky.
Is there a better way to test equivalence of two etree Elements?
Comparing Elements directly:
import xml.etree.ElementTree as etree
h1 = etree.Element('hat',{'color':'red'})
h2 = etree.Element('hat',{'color':'red'})
h1 == h2 # False
Comparing Elements as strings:
etree.tostring(h1) == etree.tostring(h2) # True
Answer #1:
This compare function works for me:
def elements_equal(e1, e2):
if e1.tag != e2.tag: return False
if e1.text != e2.text: return False
if e1.tail != e2.tail: return False
if e1.attrib != e2.attrib: return False
if len(e1) != len(e2): return False
return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))
Answer #2:
Comparing strings doesn’t always work. The order of the attributes should not matter for considering two nodes equivalent. However, if you do string comparison, the order obviously matters.
I’m not sure if it is a problem or a feature, but my version of lxml.etree preserves the order of the attributes if they are parsed from a file or a string:
>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)
False
This might be version-dependent (I use Python 2.7.3 with lxml.etree 2.3.2 on Ubuntu); I remember that I couldn’t find a way of controlling the order of the attributes a year ago or so, when I wanted to (for readability reasons).
As I need to compare XML files that were produced by different serializers, I see no other way than recursively comparing tag, text, attributes, and children of every node. And of course tail, if there’s anything interesting there.
Comparison of lxml and xml.etree.ElementTree
The truth is that it may be implementation dependent. Apparently, lxml uses ordered dict or something like that, the standard xml.etree.ElementTree does not preserve the order of attributes:
Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)
False
>>> etree.tostring(h1)
'<hat color="blue" price="39.90"/>'
>>> etree.tostring(h2)
'<hat price="39.90" color="blue"/>'
>>> etree.dump(h1)
<hat color="blue" price="39.90"/>>>> etree.dump(h2)
<hat price="39.90" color="blue"/>>>>
(Yes, the newlines are missing. But it is a minor problem.)
>>> import xml.etree.ElementTree as ET
>>> h1 = ET.XML('<hat color="blue" price="39.90"/>')
>>> h1
<Element 'hat' at 0x2858978>
>>> h2 = ET.XML('<hat price="39.90" color="blue"/>')
>>> ET.dump(h1)
<hat color="blue" price="39.90" />
>>> ET.dump(h2)
<hat color="blue" price="39.90" />
>>> ET.tostring(h1) == ET.tostring(h2)
True
>>> ET.dump(h1) == ET.dump(h2)
<hat color="blue" price="39.90" />
<hat color="blue" price="39.90" />
True
Another question may be what is considered unimportant whan comparing. For example, some fragments may contain extra spaces and we do not want to care. This way, it is always better to write some serializing function that works exactly we need.
Answer #3:
Serializing and deserializing won’t work for XML because attributes are not order dependent (and other reasons) E.g. these two elements are logically the same, but different strings:
<THING a="foo" b="bar"></THING>
<THING b="bar" a="foo" />
Exactly how to do an element comparison is tricky. As far as I can tell, there is nothing built into Element Tree to do this for you. I needed to do this myself, and used the code below. It works for my needs, but its not suitable for large XML structures and is not fast or efficient! This is an ordering function rather than an equality function, so a result of 0 is equal and anything else is not. Wrapping it with a True or False returning function is left as an exercise for the reader!
def cmp_el(a,b):
if a.tag < b.tag:
return -1
elif a.tag > b.tag:
return 1
elif a.tail < b.tail:
return -1
elif a.tail > b.tail:
return 1
#compare attributes
aitems = a.attrib.items()
aitems.sort()
bitems = b.attrib.items()
bitems.sort()
if aitems < bitems:
return -1
elif aitems > bitems:
return 1
#compare child nodes
achildren = list(a)
achildren.sort(cmp=cmp_el)
bchildren = list(b)
bchildren.sort(cmp=cmp_el)
for achild, bchild in zip(achildren, bchildren):
cmpval = cmp_el(achild, bchild)
if cmpval < 0:
return -1
elif cmpval > 0:
return 1
#must be equal
return 0
Answer #4:
Believe it or not that is actually the best way to handle comparing two nodes if you don’t know how many children each may have and you want to include all children in the search.
Of course, if you simply have a childless node like the one you are demonstrating, you can simply compare the tag, attrib, and tail properties:
if h1.tag == h2.tag and h1.attrib == h2.attrib and h1.tail == h2.tail:
print("h1 and h2 are the same")
else
print("h1 and h2 are the different")
I don’t see any major benefit of this over using tostring, however.
Answer #5:
An usual way to compare complex structures is to dump them in a common unique textual representation and compare the resulting strings for equality.
To compare two received json strings, you would convert them to json objects, and then convert them back to strings (with the same convertor) and compare. I did it to check json feeds, it works well.
For XML, it is almost the same, but you may have to handle (strip? remove?) the “.text” parts (the text, blank or not, that may be found outside tags).
So in short, your solution is not a hack, as long as you make sure two equivalent XMLs (according to your context) will have the same string representation.
Answer #6:
Do not gold plate. The one you have is a good comparison. At the end XML it is TEXT.