Why doesn’t xpath work when processing an XHTML document with lxml (in python)?

Posted on

Question :

Why doesn’t xpath work when processing an XHTML document with lxml (in python)?

I am testing against the following test document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
<html xmlns="http://www.w3.org/1999/xhtml">
        <title>hi there</title>
        <img class="foo" src="bar.png"/>

If I parse the document using lxml.html, I can get the IMG with an xpath just fine:

>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]

However, if I parse the document as XML and try to get the IMG tag, I get an empty result:

>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")

I can navigate to the element directly:

>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>

But of course that doesn’t help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]

But that xpath is, again, obviously not useful for parsing arbitrary documents.

Obviously I am missing some key issue here, but I don’t know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don’t know what else I might need to consider in regards to namespaces.

So, what am I missing?

Asked By: John


Answer #1:

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

Try this:

>>> tree.getroot().xpath(
...     "//xhtml:img", 
...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
...     )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]
Answered By: Ned Batchelder

Answer #2:

XPath considers all unprefixed names to be in “no namespace”.

In particular the spec says:

“A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). “

See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that’s being used) and to use it to prefix any unprefixed name in the XPath expression.

Hope this helped.


Dimitre Novatchev

Answered By: Dimitre Novatchev

Answer #3:

If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.

In your case it would be like

from lxml import objectify
root = objectify.parse(url) #also available: fromstring

You can access the nodes as

body = root.html.body
for img in body.img: #Assuming all images are within the body tag

While it might not be of great help in html, it can be highly useful in well structured xml.

For more info, check out http://lxml.de/objectify.html

Answered By: Sharmila

Leave a Reply

Your email address will not be published. Required fields are marked *