Help – Parsing HTML in python?

Help – Parsing HTML in python?

I’d like to write a python script which takes an arbitrary html page and (tries to) extract(s) and displays its text.  This strikes me as something someone else would have already done, perhaps a million times.  I have tried to do this with minidom, but it (minidom) choked on the one test html I gave it to parse.  I have also tried to download and install lxml – without any success.

Does anyone have any thoughts?  Are these even the right tools?  What should I be looking at?

[Update: have rethought my requirements and will probably meet a good part of them – see comments]

[Update 2: Python’s default ascii type string is astoundingly frustrating.  Somewhere, sometime, when you least expect it, it comes up to bite you and you spend an hour or two trying to re-route what you’re doing around some problem with one of the string methods – Arghhhh!!!!]

[Update 3: Essence for understanding unicode: Unicode u’\u201c” == ->“<- is a unicode string. ‘\xe2\x80\x9c’ is not a unicode string.  In fact, it’s just a sequence of hexadecimal bytes. By providing it a context for interpretation it can become a unicode string’\xe2\x80\x9c’.decode(‘utf8′) == u’\u201c’).  If you give it the wrong context for interpretation it will become a unicode string, just the wrong one: ‘xe2\x80\x9c’.decode(‘cp1252′) == u’\xe2\u20ac\u0153’ which is just garbage ->“<- when you print the glyphs.]


4 Responses to “Help – Parsing HTML in python?”

  1. 1 Peter 2 June 2009 at 12:28 pm

    BeautifulSoup ( ). Probably not the fastest option around, but pretty much the most reliable, especially for tricksy HTML.

  2. 2 Nick 2 June 2009 at 12:29 pm

    I’ve had encodings problems with it, though (turning an arbitrary webpage into ASCII text). Let me know if you figure out the right way to do that

  3. 3 Malcolm Tredinnick 2 June 2009 at 3:42 pm

    Second the recommendation to look at It’s probably the simplest way to get what you’re after and is not too hard to use inside another program (the html2text() function is the right place to start calling if you’re doing that).

    For more advanced “convert to DOM and manipulate” stuff, lxml’s HTMLParser() is probably the nicest solution. I’m a little surprised that’s causing you problems (old-ish Linux install, so it’s not just available as a package in the system?). It would be worth persisting with if you want to go that route. The challenging bit with using lxml (or similar) is that you get back a fairly complex data structure that you have to traverse to pull out the text nodes and then throw away random whitespace stuff that you probably don’t care about.

    The other library that is at the forefront of usefulness in this are is html5lib (which has a Python implementation). If you’re doing sanitisation work — stripping out dangerous elements from user-provided input, for example — it’s probably the best approach.

    For what you’ve described, however, I suspect, which is only a single file, is probably the ticket.

  4. 4 brendanscott 2 June 2009 at 5:24 pm

    Thanks for the comments everyone.

    I had seen html2text earlier. Based on the comments I have looked at it again. On reflection I suspect it will be a 90% or so fit for what I want to achieve. I am thinking that getting lxml etc up and running would be aiming for perfection, which maybe I can do later.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog Stats

  • 248,553 hits

OSWALD Newsletter

If you would like to receive OSWALD, a weekly open source news digest please send an email to oswald (with the subject "subscribe") at

%d bloggers like this: