Search code examples
pythondata-extractionscrapely

How to extract a list of items using scrapely?


I'm using scrapely to extract data from some HTML, but I'm having difficulties extracting a list of items.

The scrapely github project describes only a simple example:

from scrapely import Scraper
s = Scraper()

s.train(url, data)
s.scrape(another_url)

This is nice if, for example, you are trying to extract data as described:

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows that section is a quick example of the simplest possible usage, that you can run in a Python shell.

However, I'm not sure how to extract data if you found something like

Ingredientes

- 50 gr de hojas de albahaca
- 4 cucharadas (60 ml) de piñones
- 2 - 4 dientes de ajo
- 120 ml (1/2 vaso) de aceite de oliva virgen extra
- 115 gr de queso parmesano recién rallado
- 25 gr de queso pecorino recién rallado ( o queso de leche de oveja curado)

I know I can't extract this by using xpath or css selector, but I'm more interested in using parsers that can extract data for similar pages.


Solution

  • Scrapely can be trained to extract a list of items. The trick is to pass the first and last items of the list to be extracted as a Python list when training. Here an example inspired by the question: (Training: 10-item ingredient list from url1, test: 7-item list from url2.)

    from scrapely import Scraper
    
    s = Scraper()
    
    url1 = 'http://www.sabormediterraneo.com/recetas/postres/leche_frita.htm'
    data = {'ingreds': ['medio litro de leche',   # first and last items
      u'canela y az\xfacar para espolvorear']}
    s.train(url1, data)
    
    url2 = 'http://www.sabormediterraneo.com/recetas/cordero_horno.htm'
    print s.scrape(url2)
    

    Here the output:

    [{u'ingreds': [
      u' 2 piernas o dos paletillas de cordero lechal o recental ',
      u'3 dientes de ajo',
      u'una copita de vino tinto / o / blanco',
      u'una copita de agua',
      u'media copita de aceite de oliva',
      u'or\xe9gano, perejil',
      u'sal, pimienta negra y aceite de oliva']}]
    

    Training on the question's ingredient list (http://www.sabormediterraneo.com/cocina/salsas6.htm) did not generalize directly to the "recetas" pages. One solution would be to train several scrapers and then check which one works on a given page. (Training one scraper on several pages did not give a general solution in a quick test of mine.)