Im very new into this and Im trying to scrape a pdf and create a DataFrame with the info. I can get a pandas series from the scrape, I do with the following code:
import pandas as pd
import tika
from tika import parser
tika.initVM()
file_name = "the_pdf_im_scraping"
def input_file_processing(file_name):
parsedPDF = parser.from_file(file_name)
content = parsedPDF['content']
contentlist = content.split('\n')
contentlist = list(filter(lambda a: a != '', contentlist))
return contentlist
contentlist = input_file_processing(file_name)
This create a series that look like the following:
["John, Doe", '(803) 470-9419', "Company 1", 'PO Box 23425', 'Columbia, SC 14550', '[email protected]', 'Aaron, Rust. ', 'Fax (864) 751-5784', '1317 Waterfall, #2334', 'Orlando, FL 32804', '[email protected]', 'Betn, S. Raul', 'Fax (864) 666-4484', 'S. Raul Baron, P.A.', '1456 Edgewater Drive, #2034', 'Orlando, FL 32804', '[email protected]', 'Abueno, Daniel George', '(444) 123-2633', '3456 Robert Drive', '#2316', 'Charleston, SC 29492', '[email protected]']
And now Im stucked trying to create a DataFrame with the following columns:
last_name,first_name,company,phone_number,fax_number,address,city,state,zip_code,email
I don't know if its very complicated or what, but the main problems seems to be that someones have company name while others no, the same with fax and others elementes.
Also some address breaks into 2 lines or 2 elements of the series.
Apart from the fact that all kinds of info do not always exist, they are also not presented in the same order, so it is hard to know what's what for sure (company name or address, for instance).
But, with the toy data you provided:
results = [
"John, Doe",
"(803) 470-9419",
"Company 1",
"PO Box 23425",
"Columbia, SC 14550",
"[email protected]",
"Aaron, Rust. ",
"Fax (864) 751-5784",
"1317 Waterfall, #2334",
"Orlando, FL 32804",
"[email protected]",
"Betn, S. Raul",
"Fax (864) 666-4484",
"S. Raul Baron, P.A.",
"1456 Edgewater Drive, #2034",
"Orlando, FL 32804",
"[email protected]",
"Abueno, Daniel George",
"(444) 123-2633",
"3456 Robert Drive",
"#2316",
"Charleston, SC 29492",
"[email protected]",
]
Here is one way to get most of what you want as a dataframe:
import pandas as pd
# Find index of email addresses
idx = [i for i, item in enumerate(results) if "@" in item]
# Split data on email addresses
data = []
i = 0
for j in idx:
data.append(results[i : j + 1])
i = j + 1
# Iterate on each list in data and populate a dictionary
clean_data = []
for person in data:
infos = {"misc": []}
infos["identity"] = person[0]
infos["state"] = person[-2].split(", ")[0]
infos["zip_code"] = person[-2].split(", ")[1]
for item in person[1:-2] + [person[-1]]:
if "@" in item:
infos["email"] = item
elif item.startswith("Fax"):
infos["fax_number"] = item
elif (
item.replace("(", "")
.replace(")", "")
.replace(" ", "")
.replace("-", "")
.isnumeric()
):
infos["phone_number"] = item
else:
infos["misc"].append(item)
clean_data.append(infos)
# Convert dictionary to dataframe and reorder columns
df = pd.DataFrame(clean_data).reindex(
columns=[
"identity",
"phone_number",
"fax_number",
"state",
"zip_code",
"email",
"misc",
]
)
Then:
print(df)
# Output
identity phone_number fax_number state
0 John, Doe (803) 470-9419 NaN Columbia \
1 Aaron, Rust. NaN Fax (864) 751-5784 Orlando
2 Betn, S. Raul NaN Fax (864) 666-4484 Orlando
3 Abueno, Daniel George (444) 123-2633 NaN Charleston
zip_code email
0 SC 14550 [email protected] \
1 FL 32804 [email protected]
2 FL 32804 [email protected]
3 SC 29492 [email protected]
misc
0 [Company 1, PO Box 23425]
1 [1317 Waterfall, #2334]
2 [S. Raul Baron, P.A., 1456 Edgewater Drive, #2034]
3 [3456 Robert Drive, #2316]