Search code examples
pythonpdftext-processing

Opening and preprocessing text (300 PDFs) in Python


I am supposed to preprocess some PDFs in a folder. I am supposed to remove punctuation, make everything lower case and remove stopwords, and add some extra data from another CSV to it (as metadata). But I cannot even open them. All the googling does not help, since I do not understand the error message (none of the examples from other people helped, since they had different data types).

This is my code so far:

import PyPDF2
import re

for k in range(1,312):
    # open the pdf file
    object = PyPDF2.PdfFileReader("/Users/n_n/Desktop/Digitalization/reserve" % (k))
    

and this is what happens


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [37], in <cell line: 4>()
      2 import re
      4 for k in range(1,312):
      5     # open the pdf file
----> 6     object = PyPDF2.PdfFileReader("/Users/n_n/Desktop/Digitalization/reserve" % (k))

TypeError: not all arguments converted during string formatting

Solution

  • You have forgotten to add the string formatting parameter:

    object = PyPDF2.PdfFileReader("/Users/n_n/Desktop/Digitalization/reserve%s" % k)
    

    Note the "%s" at the end of the file path string. When formatting with the % operator, the "%s" is replaced by the formatting argument you pass, which in this case it's str(k).