Help – Parsing HTML in python?
I’d like to write a python script which takes an arbitrary html page and (tries to) extract(s) and displays its text. This strikes me as something someone else would have already done, perhaps a million times. I have tried to do this with minidom, but it (minidom) choked on the one test html I gave it to parse. I have also tried to download and install lxml – without any success.
Does anyone have any thoughts? Are these even the right tools? What should I be looking at?
[Update: have rethought my requirements and html2ascii.py will probably meet a good part of them – see comments]
[Update 2: Python’s default ascii type string is astoundingly frustrating. Somewhere, sometime, when you least expect it, it comes up to bite you and you spend an hour or two trying to re-route what you’re doing around some problem with one of the string methods – Arghhhh!!!!]
[Update 3: Essence for understanding unicode: Unicode u’\u201c” == ->“<- is a unicode string. ‘\xe2\x80\x9c’ is not a unicode string. In fact, it’s just a sequence of hexadecimal bytes. By providing it a context for interpretation it can become a unicode string’\xe2\x80\x9c’.decode(‘utf8′) == u’\u201c’). If you give it the wrong context for interpretation it will become a unicode string, just the wrong one: ‘xe2\x80\x9c’.decode(‘cp1252′) == u’\xe2\u20ac\u0153’ which is just garbage ->â€œ<- when you print the glyphs.]