Search code examples
requestscrapyresponsemetadeferred

Scrapy: continue process result from parse function


I am trying to parse page A, download files listed in the page to local disk, replace URL in page A with URL to the files I saved, and finally save page A to local disk.

I tried file pipeline but it just does not work. The URL in page A looks like http:...php?id=1234 so build-in file_path() returns an error. Overriding file_path() just stops pipeline working without any debug output.

So I found this post:

Answer I referred

After I applied I found the parsing function won't change the data I passed in meta. My code is like:

def ParseClientCaseNote(self,response):
        # The function is to download all attachments and replace URL inside pointing to local files
        TestMeta='this is to test meta argu'
        for a in AttachmentList:
            yield scrapy.Request(a,callback=self.DownClientCaseNoteAttach,meta={'test':TestMeta})

        self.logger.info('ParseClientCaseNote: after call DownClientCaseNoteAttach, testmeta is: ' + TestMeta)

        return

def DownClientCaseNoteAttach(self,response):
        TestArg=response.meta['test']
        self.logger.info('DownClientCaseNoteAttach: test meta')
        self.logger.info(TestArg)
        TestArg='this is revised from DownClientCaseNoteAttach'

        with open(AbsPath,'wb') as f:
            f.write(response.body)
        return

I got below result in log:

2018-09-29 09:26:13 [debug] INFO: ParseClientCaseNote: after call DownClientCaseNoteAttach, testmeta is: this is to test meta argu 2018-09-29 09:26:17 [debug] INFO: DownClientCaseNoteAttach: test meta 2018-09-29 09:26:17 [debug] INFO: this is to test meta argu

It seems parsing function is deferred. How can I get the result correctly?

Thanks


Solution

  • I used a workaround to address this. In page A I get file name on web and pass the name to own download function change the url pointing to local file with name on web. In download function I verify the file name from response.headers['Content-Disposition'].decode(response.headers.encoding) to ensure it is the same as I find on page A before save it.