I have code that pulls over 400 PDFs off a website via Beautiful Soup. PyPDF2 converts the PDFs to text, which is then saved as a jsonlines file called 'output.jsonl'.
When I save new PDFs in future updates, I want PyPDF to only convert the new PDFs to text and append the jsonlines file with that new text, which is where I am struggling.
The jsonlines file looks like this:
{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}...
The PDFs are named "1234", "1235", etc and are saved in file_path_PDFs. I am trying to recognize if that "id" is a value in the jsonlines file, then there is no need for PyPDF2 to convert it to text. If it does not exist, process away as usual.
file_path_PDFs = 'C:/Users/.../PDFs/'
json_list = []
for filename in os.listdir(file_path_PDFs):
if os.path.exists('C:/Users/.../PDFs/output.jsonl'):
with jsonlines.open('C:/Users/.../PDFs/output.jsonl') as reader:
mytext = jsonlines.Reader.iter(reader)
for obj in mytext:
if filename[:-4] in mytext: #filename[:-4] removes .pdf from string
continue
else:
~convert to text~
with jsonlines.open('C:/Users/.../PDFs/output.jsonl', 'a') as writer:
writer.write_all(json_list)
As is, I believe this code is not finding any of the values and is converting ALL the text each time I run it. Obviously this is quite a lengthy process with each document spanning 200 or 300 pages.
id
field to the DataFrame.
list
) to aid in future expansion and flexibility.After working through (what I believe to be) your scenario, we have the following setup/requirements:
output.jsonl
.output.jsonl
file contains (n) dictionaries; one for each PDF parsed by PyPDF2.output.jsonl
.If this is correct, let's change tack and take the following approach:
list
of PDF filenames (called pdfs
).id
field from the jsonlines file (output.jsonl
) into a pandas.DataFrame
(called df
).pdfs
list and test whether the filename (id
) is in the DataFrame (df
).notin
).notin
list
to parse these new files into ... whatever you like.My (extended) output.jsonl
file looks like this:
{"id": "1234", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1235", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1236", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1237", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
{"id": "1238", "title": "Transcript", "url": "www.stackoverflow.com", "text": "200 pages worth of text"}
Here's the commented code to accomplish the steps above:
import os
import jsonlines
import pandas as pd
# Set the path to output.jsonl
path = os.path.expanduser('~/Desktop/output.jsonl')
# Build a list of PDFs (You'll use `os.listdir()`)
pdfs = ['1234.pdf', '1235.pdf', '1236.pdf', '1237.pdf',
'1238.pdf', '5000.pdf', '5001.pdf']
# Create an empty DataFrame.
df = pd.DataFrame()
# Read output.jsonl
with jsonlines.open(path) as reader:
for line in reader.iter():
# Add 'id' value to the DataFrame.
df = df.append({'id': line.get('id')}, ignore_index=True)
# Display the DataFrame's contents.
print('Contents of the jsonlines file:\n')
print(df)
# Loop over the PDF filenames and test if each filename is in the DataFrame.
notin = [i for i in pdfs if os.path.splitext(i)[0] not in df['id'].values]
# Display the results.
print('\nThese PDFs are not in your jsonlines file:')
print(notin)
The output; note that files 5000.pdf and 5001.pdf were not found:
Contents of the jsonlines file:
id
0 1234
1 1235
2 1236
3 1237
4 1238
These PDFs are not in your jsonlines file:
['5000.pdf', '5001.pdf']