From the doc string:
A very spartan attempt of a script that converts HTML to
plaintext.
The original use for this little script was when I send HTML emails out I also
wanted to send a plaintext version of the HTML email as multipart. Instead of
having two methods for generating the text I decided to focus on the HTML part
first and foremost (considering that a large majority of people don't have a
problem with HTML emails) and make the fallback (plaintext) created on the fly.
This little script takes a chunk of HTML and strips out everything except the
<body> (or an elemeny ID) and inside that chunk it makes certain conversions
such as replacing all hyperlinks with footnotes where the URL is shown at the
bottom of the text instead. <strong>words</strong> are converted to *words*
and it does a fair attempt of getting the linebreaks right.
As a last resort, it strips away all other tags left that couldn't be gracefully
replaced with a plaintext equivalent.
Thanks for Fredrik Lundh's unescape() function things like:
'Terms &amp; Conditions' is converted to
'Termss & Conditions'
It's far from perfect but a good start. It works for me for now.
Version at the time of writing this: 0.1.
I wouldn't be surprised if I've reinvented the wheel here but I did plenty of searches and couldn't really find anything like this.
Let's run this for a while until I stumble across some bugs or other inconsistencies which I haven't quite done yet. The one thing I'm really unhappy about is the way I extract the body from the BeautifulSoup parse object. I really couldn't find another better way in the few minutes I had to spare on this.
Feel free to comment on things you think are pressing bugs.
You can download the script here html2plaintext.py version 0.1
UPDATE
I should take a second look at Aaron Swartz's html2text.py script the next time I work on this. His script seems a lot more mature and Aaron is brilliant Python developer.