Search code examples
htmllinuxtext-extractiondownload

HTML downloading and text extraction


What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.


Solution

  • wget | html2ascii

    Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

    See also: lynx.