Search code examples
pythonjsonlines

If statement based on value existing in jsonlines file


I have code that pulls over 400 PDFs off a website via Beautiful Soup. PyPDF2 converts the PDFs to text, which is then saved as a jsonlines file called 'output.jsonl'.

When I save new PDFs in future updates, I want PyPDF to only convert the new PDFs to text and append the jsonlines file with that new text, which is where I am struggling.

The jsonlines file looks like this:

{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}...

The PDFs are named "1234", "1235", etc and are saved in file_path_PDFs. I am trying to recognize if that "id" is a value in the jsonlines file, then there is no need for PyPDF2 to convert it to text. If it does not exist, process away as usual.

file_path_PDFs = 'C:/Users/.../PDFs/'
json_list = []

for filename in os.listdir(file_path_PDFs):   
    if os.path.exists('C:/Users/.../PDFs/output.jsonl'):
        with jsonlines.open('C:/Users/.../PDFs/output.jsonl') as reader:
            mytext = jsonlines.Reader.iter(reader)
            for obj in mytext:
                if filename[:-4] in mytext: #filename[:-4] removes .pdf from string
                    continue
                else:
                    ~convert to text~

with jsonlines.open('C:/Users/.../PDFs/output.jsonl', 'a') as writer:
    writer.write_all(json_list)

As is, I believe this code is not finding any of the values and is converting ALL the text each time I run it. Obviously this is quite a lengthy process with each document spanning 200 or 300 pages.


Solution

  • Updates:

    • Optimised to only store the id field to the DataFrame.
      • A DataFrame was kept (rather than a list) to aid in future expansion and flexibility.

    Answer:

    After working through (what I believe to be) your scenario, we have the following setup/requirements:

    • You have one jsonlines file called output.jsonl.
    • This output.jsonl file contains (n) dictionaries; one for each PDF parsed by PyPDF2.
    • We must loop through a directory of 400+ parsed PDF files and determine if that PDF's filename is in output.jsonl.

    If this is correct, let's change tack and take the following approach:

    • Create a list of PDF filenames (called pdfs).
    • Read the id field from the jsonlines file (output.jsonl) into a pandas.DataFrame (called df).
    • Loop through the pdfs list and test whether the filename (id) is in the DataFrame (df).
    • If not, add the filename to a list (called notin).
    • Do as you wish with the notin list to parse these new files into ... whatever you like.

    My (extended) output.jsonl file looks like this:

    {"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
    {"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
    {"id": "1236", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
    {"id": "1237", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
    {"id": "1238", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
    

    Here's the commented code to accomplish the steps above:

    import os
    import jsonlines
    import pandas as pd
    
    # Set the path to output.jsonl
    path = os.path.expanduser('~/Desktop/output.jsonl')
    # Build a list of PDFs (You'll use `os.listdir()`)
    pdfs = ['1234.pdf', '1235.pdf', '1236.pdf', '1237.pdf', 
            '1238.pdf', '5000.pdf', '5001.pdf']
    # Create an empty DataFrame.
    df = pd.DataFrame()
    
    # Read output.jsonl
    with jsonlines.open(path) as reader:
        for line in reader.iter():
            # Add 'id' value to the DataFrame.
            df = df.append({'id': line.get('id')}, ignore_index=True)
    # Display the DataFrame's contents.
    print('Contents of the jsonlines file:\n')
    print(df)
    
    # Loop over the PDF filenames and test if each filename is in the DataFrame.
    notin = [i for i in pdfs if os.path.splitext(i)[0] not in df['id'].values]
    # Display the results.
    print('\nThese PDFs are not in your jsonlines file:')
    print(notin)    
    

    The output; note that files 5000.pdf and 5001.pdf were not found:

    Contents of the jsonlines file:
    
         id
    0  1234
    1  1235
    2  1236
    3  1237
    4  1238
    
    These PDFs are not in your jsonlines file:
    ['5000.pdf', '5001.pdf']