Search code examples
pythonscrapydelimiter

Simple way to change scrapy .getall() delimiter


I'm running a basic scrapy crawler and I can't seem to find any documentation within scrapy that allows me to change the delimiter of a .getall(). The default appears to be comma separated, but I'm assuming this might cause some errors in data importing elsewhere.

Ideally, I want the exported csv to be comma separated, but the getall() data is pipe or semi-colon separated. I would prefer to fix this efficiently within the scrapy script. For example, say the bit containing the .getall() is

def entry_parse(self, response):
    for entry in response.xpath("//tbody[@class='entry-grid-body infinite']//td[@class]"):
        yield {'entry_labels': entry.xpath(".//div[@class='entry-labels']/span/text()").getall()}

Ideally, it would be nice to be able pass such an argument into getall() or something similar, but I can't seem to find any documentation allowing that. Any ideas would be helpful! Thanks.


Solution

  • This is not really a problem of scrapy. Since the .getall() method returns a list and the repr of lists have commas by default

    >>>repr(["a","b"])
    "['a', 'b']"
    

    you can use json.dumps and change the delimiter before yielding the item using the separators argument

    import json
    def entry_parse(self, response):
        for entry in response.xpath("//tbody[@class='entry-grid-body infinite']//td[@class]"):
            yield {
                'entry_labels': json.dumps(
                    entry.xpath(".//div[@class='entry-labels']/span/text()").getall()
                    , separators=("|", ":")
                    )
            }