If statement based on value existing in jsonlines file

I have code that pulls over 400 PDFs off a website via Beautiful Soup. PyPDF2 converts the PDFs to text, which is then saved as a jsonlines file called 'output.jsonl'.

When I save new PDFs in future updates, I want PyPDF to only convert the new PDFs to text and append the jsonlines file with that new text, which is where I am struggling.

The jsonlines file looks like this:

{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}...

The PDFs are named "1234", "1235", etc and are saved in file_path_PDFs. I am trying to recognize if that "id" is a value in the jsonlines file, then there is no need for PyPDF2 to convert it to text. If it does not exist, process away as usual.

file_path_PDFs = 'C:/Users/.../PDFs/'
json_list = []

for filename in os.listdir(file_path_PDFs):   
    if os.path.exists('C:/Users/.../PDFs/output.jsonl'):
        with jsonlines.open('C:/Users/.../PDFs/output.jsonl') as reader:
            mytext = jsonlines.Reader.iter(reader)
            for obj in mytext:
                if filename[:-4] in mytext: #filename[:-4] removes .pdf from string
                    continue
                else:
                    ~convert to text~

with jsonlines.open('C:/Users/.../PDFs/output.jsonl', 'a') as writer:
    writer.write_all(json_list)

As is, I believe this code is not finding any of the values and is converting ALL the text each time I run it. Obviously this is quite a lengthy process with each document spanning 200 or 300 pages.

Solution

Updates:

Optimised to only store the id field to the DataFrame.
- A DataFrame was kept (rather than a list) to aid in future expansion and flexibility.

Answer:

After working through (what I believe to be) your scenario, we have the following setup/requirements:

You have one jsonlines file called output.jsonl.
This output.jsonl file contains (n) dictionaries; one for each PDF parsed by PyPDF2.
We must loop through a directory of 400+ parsed PDF files and determine if that PDF's filename is in output.jsonl.

If this is correct, let's change tack and take the following approach:

Create a list of PDF filenames (called pdfs).
Read the id field from the jsonlines file (output.jsonl) into a pandas.DataFrame (called df).
Loop through the pdfs list and test whether the filename (id) is in the DataFrame (df).
If not, add the filename to a list (called notin).
Do as you wish with the notin list to parse these new files into ... whatever you like.

My (extended) output.jsonl file looks like this:

{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1236", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1237", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1238", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}

Here's the commented code to accomplish the steps above:

import os
import jsonlines
import pandas as pd

# Set the path to output.jsonl
path = os.path.expanduser('~/Desktop/output.jsonl')
# Build a list of PDFs (You'll use `os.listdir()`)
pdfs = ['1234.pdf', '1235.pdf', '1236.pdf', '1237.pdf', 
        '1238.pdf', '5000.pdf', '5001.pdf']
# Create an empty DataFrame.
df = pd.DataFrame()

# Read output.jsonl
with jsonlines.open(path) as reader:
    for line in reader.iter():
        # Add 'id' value to the DataFrame.
        df = df.append({'id': line.get('id')}, ignore_index=True)
# Display the DataFrame's contents.
print('Contents of the jsonlines file:\n')
print(df)

# Loop over the PDF filenames and test if each filename is in the DataFrame.
notin = [i for i in pdfs if os.path.splitext(i)[0] not in df['id'].values]
# Display the results.
print('\nThese PDFs are not in your jsonlines file:')
print(notin)

The output; note that files 5000.pdf and 5001.pdf were not found:

Contents of the jsonlines file:

     id
0  1234
1  1235
2  1236
3  1237
4  1238

These PDFs are not in your jsonlines file:
['5000.pdf', '5001.pdf']