Parsing HTML in Python [closed]

Posted on

Question :

Parsing HTML in Python [closed]

What’s my best bet for parsing HTML if I can’t use BeautifulSoup or lxml? I’ve got some code that uses SGMLlib but it’s a bit low-level and it’s now deprecated.

I would prefer if it could stomache a bit of malformed HTML although I’m pretty sure most of the input will be pretty clean.

Answer #1:

Python has a native HTML parser, however the Tidy wrapper Nick suggested would probably be a solid choice as well. Tidy is a very common library, (written in C is it?)

Answered By: Andrei Taranchenko

Answer #2:

Perhaps µTidylib will meet your needs?

Answered By: Nick Presta

Answer #3:

You can install lxml and many other python modules easily and seamlessly on the Mac (OS X) using Pallet, which is the MacPorts official GUI

The module name is py27-lxml. Easy as 1,2,3.

Answered By: Gussisaurio

Answer #4:

http://www.xmlhack.com/read.php?item=1392
http://sourceforge.net/projects/pirxx/

http://pyxml.sourceforge.net/topics/

I don’t have much experience with python, but I have used Xerces (from the Apache foundation) in the past and found it to be very useful. The learning curve isn’t bad either, though I’m not coming from a python perspective. I suggest you consider it though. (The first two links I’ve included discuss python interfaces to Xerces and the last one is the first google hit on “python xml”).

Answered By: Joe Bane

Answer #5:

htql is good at handling malformed html:

http://htql.net/

Answered By: seagulf

Answer #6:

html5lib is good:
http://code.google.com/p/html5lib/

Update: The link above is broken. A third-party mirror of above, can be accessed from https://github.com/html5lib/gcode-import

Answered By: rudyryk

Leave a Reply

Your email address will not be published. Required fields are marked *