Search code examples
pythonscrapypipeline

Passing scraped data in piplines __init__ scrapy for python


I am trying to pass the items that contain the title data to my piplines. Is there a way to this inside the parse because the data gets reset for the next page. I tried super(mySpider,self).__init__(*args,*kwargs) but data is not sent correctly. I need to get the title of the webpage as the filename so thats why I need the specific item in there.

Something like this.

   def __init__(self, item):

      self.csvwriter = csv.writer(open(item['title'][0]+'.csv', 'wb'), delimiter=',')
      self.csvwriter.writerow(['Name','Date','Location','Stars','Subject','Comment','Response','Title'])

Solution

  • The input for any pipeline is your Item. In your case, you would need to pass the name (or any other data) in your Item. Then, you should write a pipeline to write that item to file system (or database or you can do what ever you want).

    Sample code

    Let's say your new pipeline is named 'NewPipeline' and is located inside the main root of your scrapy project.

    In your setting, you would need to define that pipeline as this:

    ITEM_PIPELINES = {
        'YourRootDirectory.NewPipleline.NewPipeline':800
    #add any other pipelines you have
    }
    

    And your pipeline should be like this:

    class NewPipeline(object):
        def process_item(self, item, spider):
            name = item['name']
            self.file = open("pathToWhereYouWantToSave"+ name, 'wb')
            line = json.dumps(dict(item)) #change the item to a json format in one line
            self.file.write(line)#write the item to the file
    

    Note

    You can put your pipeline in any other modules.