Search code examples
javascriptnode.jstexttext-extractionrss-reader

Node.Js module for extracting web page content?


Can somebody recommend a Node.Js module or a Javascript library (not based on Readability), which can be used to extract content from web pages and RSS feeds?

I found a good PHP library that can do the job - http://fivefilters.org/content-only/ - but looking for a Node.Js module that would do the same.

Thank you!


Solution

  • I wrote a Node.js module just for this purpose called 'unfluff':

    https://github.com/ageitgey/node-unfluff

    Hopefully that will solve your problem.

    Unfluff is based on the popular "python-goose" and "goose" (Scala) page extraction libraries in case you are familiar with those.