What's is the best safe way to extract items information from pages? I mean, sometimes a item may be missing in the page and you'll end up breaking the crawler.
Look this example:
for cotacao in tabela_cotacoes:
citem = CotacaoItem()
citem['name'] = cotacao.select("td[4]/text()").extract()[0]
citem['symbol'] = cotacao.select("td/a/b/text()").extract()[0]
citem['current'] = cotacao.select("td[6]/text()").extract()[0]
citem['last_neg'] = cotacao.select("td[7]/text()").extract()[0]
citem['oscillation'] = cotacao.select("td[8]/text()").extract()[0]
citem['openning'] = cotacao.select("td[9]/text()").extract()[0]
citem['close'] = cotacao.select("td[10]/text()").extract()[0]
citem['maximum'] = cotacao.select("td[11]/text()").extract()[0]
citem['minimun'] = cotacao.select("td[12]/text()").extract()[0]
citem['volume'] = cotacao.select("td[13]/text()").extract()[0]
If some item is missing in the page, .extract() will return [] and calling [0] on them will raise an exception (out of range).
So the question is, what is the best way/approach to deal with it this.
Write a little helper function:
def extractor(xpathselector, selector):
"""
Helper function that extract info from xpathselector object
using the selector constrains.
"""
val = xpathselector.select(selector).extract()
return val[0] if val else None
And use it like this:
citem['name'] = extractor(cotacao, "td[4]/text()")
Return an appropriate value to indicate that a citem
wasn't found. In my code I returned None
, change it if necessary (for example, return ''
if it makes sense).