Search code examples
pythonhtmlpython-3.xformattingtextile

Problems converting HTML to Ascii and back, using Python (html2text, textile)


I am trying to convert HTML text to ASCII, translate it, and then convert it back to HTML.

So far, when testing the basic structure of the script, I ran into the problem that textile does not convert everything back into a readable HTML format.

I think this is caused by the indented output, which makes it hard for textile to read - but I got stuck on here.

h = html2text.html2text('<p><strong>This is a test:</strong></p><ul><li>This text will be converted to ascii</li><li>and then&nbsp;<strong>translated</strong></li><li>and lastly converted back to HTML</li></ul>')
print(h)

print('------------Converting Back to HTML-----------------------------')


html = textile.textile( h ) 
print (html)

This is the output I get:

**This is a test:**

  * This text will be converted to ascii
  * and then  **translated**
  * and lastly converted back to HTML


------------Converting Back to HTML-----------------------------
    <p><b>This is a test:</b></p>

  * This text will be converted to ascii
  * and then  <b>translated</b>
  * and lastly converted back to <span class="caps">HTML</span>

I should add, that I will use HTML data from an excel sheet in the future.


Solution

  • One important thing to note is that html2text converts HTML to markdown, not textile, so it's sort of a coincidence when produces the right results. I'd recommend looking for a converter that understands the markup language you're using. Pandoc can convert to and from just about any format.

    That said, you're correct that the indentation is causing the issue with the lists, and it can be solved by simple textual substitution on h:

    html = textile.textile(h.replace("\n  *", "\n*"))