Search code examples
pythonpandasvectorizationcoding-efficiencypdftotext

Speed Up Python Function that Extracts Text from PDF


I am currently working on a program that scrapes text from tens of thousands of PDFs of court opinions. I am relatively new to Python and am trying to make this code as efficient as possible. I have gathered from many posts on this site and elsewhere that I should be trying to vectorize my code, but I have tried three methods for doing so without results.

My reprex uses these packages and this sample data.

import os
import pandas as pd
import pdftotext
import wget

df = pd.DataFrame({'OpinionText': [""], 'URLs': ["https://cases.justia.com/federal/appellate-courts/ca6/20-6226/20-6226-2021-09-17.pdf?ts=1631908842"]})
df = pd.concat([df]*50, ignore_index=True)

I started by defining this function, which downloads the PDF, extracts the text, deletes the PDF, and then returns the text.

def Link2Text(Link):
    OpinionPDF = wget.download(Link, "Temporary_Opinion.pdf")
    with open(OpinionPDF, "rb") as f:
        pdf = pdftotext.PDF(f)
    OpinionText = "\n\n".join(pdf)
    if os.path.exists("Temporary_Opinion.pdf"):
        os.remove("Temporary_Opinion.pdf")
    return(OpinionText)

The first way that I called the function, which works but is very slow, is:

df['OpinionText'] = df['URLs'].apply(Link2Text)

Based on what I read about vectorization, I tried calling the function using:

df['OpinionText'] = Link2Text(df['URLs'])

#and, alternatively:

df['OpinionText'] = Link2Text(df['URLs'].values)

Both of these returned the same error, which is:

Traceback (most recent call last):
  File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 22, in <module>
    df['OpinionText'] = Link2Text(df['URLs'])
  File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 10, in Link2Text
    OpinionPDF = wget.download(Link, "Temporary_Opinion.pdf")
  File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 505, in download
    prefix = detect_filename(url, out)
  File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 483, in detect_filename
    if url:
  File "/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 1442, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
[Finished in 0.683s]

I gather that this is saying that Python does not know what to do with the input because it is a vector, so I tried replacing the call with the one below and got this traceback.

df['OpinionText'] = Link2Text(df['URLs'].item)

Traceback (most recent call last):
  File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 22, in <module>
    df['OpinionText'] = Link2Text(df['URLs'].item)
  File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 10, in Link2Text
    OpinionPDF = wget.download(Link, "Temporary_Opinion.pdf")
  File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 505, in download
    prefix = detect_filename(url, out)
  File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 484, in detect_filename
    names["url"] = filename_from_url(url) or ''
  File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 230, in filename_from_url
    fname = os.path.basename(urlparse.urlparse(url).path)
  File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 372, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 124, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 108, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 108, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'function' object has no attribute 'decode'

I tried adding .decode('utf-8') to my function call and within the function to the input, but got the same traceback for both. At this point, I do not know what else to try to speed up my code.

I also tried numpy.vectorize with the version that works using .apply, but it dramatically slowed down the execution. I am assuming that those two should not be used together.

In the interest of completeness, based on some excellent answers here, I also tried:

from numba import njit

@njit
def Link2Text(Link, Opinion):
    res = np.empty(Link.shape)
    for i in range(length(Link)):
        OpinionPDF = wget.download(Link[i], "Temporary_Opinion.pdf")
        with open(OpinionPDF, "rb") as f:
            pdf = pdftotext.PDF(f)
        OpinionText = "\n\n".join(pdf)
        if os.path.exists("Temporary_Opinion.pdf"):
            os.remove("Temporary_Opinion.pdf")
        Opinion[i] = OpinionText

Link2Text(df['URLs'].values, df['OpinionText'].values)

I gather that this did not work because numba does not work with the packages I am calling inside the function and is intended more for mathematical operations. If that is not correct and I should be trying to use numba for this, please let me know.


Solution

  • I took the advice in the comments. I did not use pandas, used list comprehension, and rewrote this as:

    def pdftotext(path):
        args = r'pdftotext -layout -q Temporary_Opinion.pdf Opinion_Text.txt'
        cp = sp.run(
          args, stdout=sp.PIPE, stderr=sp.DEVNULL,
          check=True, text=True
        )
        return cp.stdout
    
    def Link2Text(Link):
        OpinionPDF = wget.download(Link, "Temporary_Opinion.pdf")
        pdftotext("Temporary_Opinion.pdf")
        OpinionText = io.open("Opinion_Text.txt", mode="r", encoding="utf-8")
        OpinionText = OpinionText.readlines()
        if os.path.exists("Temporary_Opinion.pdf"):
            os.remove("Temporary_Opinion.pdf")
        if os.path.exists("Opinion_Text.txt"):
            os.remove("Opinion_Text.txt")
        return(OpinionText)
    
    Opinions = [Link2Text(item) for item in URLs]
    
    

    This is considerably faster and does exactly what I need. Thanks to everyone who offered advice on this! The next step will be using threading and layout analysis to speed up the IO and clean the data.