Search code examples
perlweb-scrapingmod-perl2

Perl, mod_perl2 or CGI for a web-scraping service?


I'm going to design an open-source web service which should collect ("web-scrape") some data from multiple - currently three - web sites.

The web sites do not expose any web service nor any API, they just publish web pages.

Data will be collected 'live' on any client's request from all the web sites in parallel, and will then be parsed to XML to be returned to the client.

The server operating system will be Linux.

The clients will initially be just an Android application of mine.

The concurrent clients will possibly be about 100 or more, if the project will be successful... ;-).

Currently my preferencese go to the adoption of:

  • perl (for the service laguage)
  • mod_perl2 with ModPerl::Registry (for an Apache embedded fast perl interpreter)
  • perl module CHI::Driver::FastMmap (for a modern and fast cache handler)
  • perl module Coro (for an async event loop to place many requests in parallel)

Since I suppose the specifications on the project can be of general use and interest, and since I am getting many problems with the combined use of Coro with mod_perl2, I ask:

Are my adoption preferences well matched?

Do you see any incompatibilities or potential problems?

Do you have any suggestion to enhance (in this order):

  • compatibility among components
  • neatness of the implementation
  • ease of maintainability
  • performances

Solution

  • You probably don't want to develop using mod_perl for any new project anymore. You really want to use something Plack based, or maybe even Plack itself. If you want to use Coro, using a AnyEvent such as Twiggy based backend may make most sense (though you may want to put a reverse proxy in front of it).