Search code examples
pythonweb-scrapinggoogle-app-engineweb-crawler

Scraping Python advice needed


I need to get product ID from a commerce website. The product ID is the number series at the end of the URLs.

For example: http://example.com/sp/123170/ has product ID 123170.

Some requirements:

  • Code must be written by Python
  • Because the number of product is large, I want the software to be able to restart after it stops due to some reasons.
  • Can run one time everyday.
  • the new product is updated/added everyday so the software need to be able to deal with that. if possible, I would love to work with Google app engine

Please recommend me some ideas and open source code for this job. I found scrapy.org and Beautifulsoup. Please also give me advice about them, which one is better for this purpose?


Solution

  • For periodic scheduling you can look for cron jobs in app engine.

    Also, Scrapy is nice framework of web scraping. Other alternative you can go with is using beautiful soup and requests API (supports authentication and multithreaded downloads).

    But I would suggest BEFORE you scrap, see whether that commerce website has provided with some API.