Search code examples
pythonscrapypypdf

How to use Scrapy to parse PDF pages online?


I tried using Scrapy with PyPDF2 library to crawl PDfs online unsuccessfully. So far I'm able to navigate all links and able to grab the PDf files, but feeding them through PyPDF2 seems to be a problem.

Note: my goal is not to grab/save PDF files, I intend to parse them by first converting PDF to text and then manipulating this text using other methods.

For brevity, I did not include the entire code here. Here's part of my code:

import io
import re
import PyPDF2
import scrapy
from scrapy.item import Item

class ArticleSpider(scrapy.Spider):
    name = "spyder_ARTICLE"                                                 
    start_urls = ['https://legion-216909.appspot.com/content.htm']                                                                      

    def parse(self, response):                                              
        for article_url in response.xpath('//div//a/@href').extract():      
            yield response.follow(article_url, callback=self.parse_pdf) 

    def parse_pdf(self, response):
        """ Peek inside PDF to check for targets.
        @return: PDF content as searcable plain-text string
        """
        reader = PyPDF2.PdfFileReader(response.body)
        text = u""

        # Title is optional, may be None
        if reader.getDocumentInfo().title: text += reader.getDocumentInfo().title
        # XXX: Does handle unicode properly?
        for page in reader.pages: text += page.extractText()

        return text

Each time I run the code, the spider attempts reader = PyPDF2.PdfFileReader(response.body) and gives the following error: AttributeError: 'bytes' object has no attribute 'seek'

What am I doing wrong?


Solution

  • That does not seem to be a problem with scrapy. PyPDF2 is expecting a stream of binary data.

    # use this instead of passing response.body directly into PyPDF2
    reader = PyPDF2.PdfFileReader(io.BytesIO(response.body))
    

    Hope this helps.