Search code examples
htmlfiltertext-processingtext-extractiontext-formatting

Stripping HTML but retaining block/inline structure


I would like to convert HTML to plain text but retain the minimum structure.

  • All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
  • Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.

The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.

Any binary, library, or source in any programming language is OK.

Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?


Solution

  • Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl

    It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.