Search code examples
pythonpdfminerpdfplumber

How to optimize (also RAM wise) code that is saving words from PDF to Python object and later into database?


I am looking for the most efficient way of saving text from PDF files into my database. Currently I am using pdfplumber with standard code looking like this:

my_string = ''

with pdfplumber.open(text_file_path) as pdf:
    for page in pdf.pages:
        if page.extract_text():
            my_string += str(page.extract_text().replace('\n', ' ').split(' '))

But current code is litelary killing my machine (it takes around 3 to 6 GB of RAM for PDF with 600 pages) and my goal is to actually host it on mobile phones.

I did some tests and it seems that reading PDF is not a problem, but saving or storing those words is problematic. I tried to create dict where each page string is one key/value but it wasn't much better.

Maybe I should try yielding each page into txt file and then just read string from this txt file?

I will be grateful for any tips, thanks!

EDIT:

    with pdfplumber.open(text_file_path) as pdf:
        for page in pdf.pages:
            connection = sqlite3.connect('my_db.db')
            cursor = connection.cursor()
            cursor.execute("INSERT INTO temp_text VALUES (?, ?)",
                           (text_file_path, str(page.extract_text()).replace('\n', ' ')))
            connection.commit()
            connection.close()

I changed code to that, and it is a little bit better, (now it takes up to around 2.9 GB of RAM) but it is still a lot. Can I do anything more about it?


Solution

  • The issue is you're storing data long-term, and this means as you incrementally process more and more data, you're still referencing it all in memory. This is what a database aims to prevent: all efficient storage and retrieval of data without needing to store it all in RAM. A simple example using PyMongo (for an iOS app, you're likely going to want to use SQLite) is the following:

    import pdfplumbder
    import poymongo
    import os
    
    
    def process_file(path, collection):
        '''Process a single file, from a path.'''
    
        basename = os.path.splitext(os.path.basename(path))[0]
        with pdfplumber.open(path) as pdf:
            for index, page in enumerate(pdf.pages):
                # Don't store any long-term references to the data
                text = page.extract_text()
                data = { 'text': text, 'filename': basename, 'page': index }
                collection.insert_one(data)
    
    def main(paths):
        '''Just a dummy entry point, pass args normally here.'''
    
        client = pymongo.MongoClient('localhost', 27017)
        database = client['myapp']
        collection = database['pdfs']
        # Sort by filename, then by page.
        collection.create_index([('filename', 1), ('page', 1)])
        for path in paths:
            process_file(path, collection)
    
        # Do what you want here
    

    As you can see, we create a connection to the local client, create or access the database we're using, and create a collection for PDF storage. We then index by the filename, then the page number.

    We then iterate over all the paths, and iteratively process them. We don't store text for more than a single page at a time, and write the data to the database every loop. For performance, this might not be optimal (although the engine might optimize this decently anyway), but it will minimize the memory required.

    Avoid using global state that processes many gigabytes of data: you're forcing Python to keep a reference to all that data when it doesn't need it.