Search code examples
parsinghtml-tablescrapyhtml-parsing

Scrapy - Trouble with <TD> parsing alignment


I'm attempting to parse data only from the item & Skill Cap columns in the html table here: http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html

When parsing I run into alignment issues where my script is parsing from other columns.

import scrapy

class parser(scrapy.Spider):
    name = "recipe_table"
    start_urls = ['http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html'] 

    def parse(self, response):
        for row in response.xpath('//*[@class="datatable sortable"]//tr'):
            data = row.xpath('td//text()').extract()
            if not data:  # skip empty row
                continue
            yield {
                'name': data[0],
                'cap': data[1],
             #   'misc': data[2]

            }

Results: scrapy runspider cap.py -t json When it reaches the 3rd row data from an unintended column is being parsed. I'm not sure whats going on with selection.

2019-05-09 19:41:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html> (referer: None)
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Set', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Table', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Cermet Kilij', 'cap': u'Cermet Kilij +1'}

Solution

  • What about explicitly set source column with XPath:

    for row in response.xpath('//*[@class="datatable sortable"]//tr'):
        yield {
            'name': row.xpath('./td[1]/text()').extract_first(),
            'cap': row.xpath('./td[3]/text()').extract_first(),
         #   'misc': etc.
        }