Search code examples
pythoncsvdictionaryscrapypipeline

Arranging one items per one column in a row of csv file in scrapy python


I had items that scraped from a site which i placed them in to json files like below

{
 "author": ["TIM ROCK"], 
 "book_name": ["Truk Lagoon, Pohnpei & Kosrae Dive Guide"], 
 "category": "Travel", 
}
{
 "author": ["JOY"], 
 "book_name": ["PARSER"], 
 "category": "Accomp", 
}

I want to store them in csv file with one dictionary per one row in which one item per one column as below

|    author   |     book_name     |    category   |
|   TIM ROCK  |  Truk Lagoon ...  |     Travel    |
|     JOY     |   PARSER          |     Accomp    |

i am getting the items of one dictionary in one row but with all the columns combined

My pipeline.py code is

import csv

class Blurb2Pipeline(object):

    def __init__(self):
        self.brandCategoryCsv = csv.writer(open('blurb.csv', 'wb'))
        self.brandCategoryCsv.writerow(['book_name', 'author','category'])

    def process_item(self, item, spider):
        self.brandCategoryCsv.writerow([item['book_name'].encode('utf-8'),
                                    item['author'].encode('utf-8'),
                                    item['category'].encode('utf-8'),
                                     ])
        return item        

Solution

  • The gist is this is very simple with csv.DictWriter:

    >>> inputs = [{
    ...  "author": ["TIM ROCK"], 
    ...  "book_name": ["Truk Lagoon, Pohnpei & Kosrae Dive Guide"], 
    ...  "category": "Travel", 
    ... },
    ... {
    ...  "author": ["JOY"], 
    ...  "book_name": ["PARSER"], 
    ...  "category": "Accomp", 
    ... }
    ... ]
    >>> 
    >>> from csv import DictWriter
    >>> from cStringIO import StringIO
    >>> 
    >>> buf=StringIO()
    >>> c=DictWriter(buf, fieldnames=['author', 'book_name', 'category'])
    >>> c.writeheader()
    >>> c.writerows(inputs)
    >>> print buf.getvalue()
    author,book_name,category
    ['TIM ROCK'],"['Truk Lagoon, Pohnpei & Kosrae Dive Guide']",Travel
    ['JOY'],['PARSER'],Accomp
    

    It would be better to join those arrays on something, but since elements can be a list or a string, it's a bit tricky. Telling if something is a string or some-other-iterable is one of the few cases in Python where direct type-checking makes good sense.

    >>> for row in inputs:
    ...     for k, v in row.iteritems():
    ...         if not isinstance(v, basestring):
    ...             try:
    ...                 row[k] = ', '.join(v)
    ...             except TypeError:
    ...                 pass
    ...     c.writerow(row)
    ... 
    >>> print buf.getvalue()
    author,book_name,category
    TIM ROCK,"Truk Lagoon, Pohnpei & Kosrae Dive Guide",Travel
    JOY,PARSER,Accomp