Search code examples
htmlperlhtml-content-extraction

How can I extract HTML content efficiently with Perl?


I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent.

My last resort is to grab HTML::Extract's source code and modify it to use a cache, but I really want to avoid that if I can. Does anyone know any other module that can perform the same job better? I basically just need to grab all the text in the <body> element with the HTML tags removed.


Solution

  • I use pQuery for my web scraping. But I've also heard good things about Web::Scraper.

    Both of these along with other modules have appeared in answers on SO for similar questions to yours: