pandas dataframe pdf web-scraping series

How to parse info in a PDF and make a dataframe?

Im very new into this and Im trying to scrape a pdf and create a DataFrame with the info. I can get a pandas series from the scrape, I do with the following code:

import pandas as pd
import tika
from tika import parser
tika.initVM()

file_name = "the_pdf_im_scraping"

def input_file_processing(file_name):
    parsedPDF = parser.from_file(file_name)
    content = parsedPDF['content']
    contentlist = content.split('\n')
    contentlist = list(filter(lambda a: a != '', contentlist))
    return contentlist

contentlist = input_file_processing(file_name)

This create a series that look like the following:

["John, Doe", '(803) 470-9419', "Company 1", 'PO Box 23425', 'Columbia, SC 14550', '[email protected]', 'Aaron, Rust. ', 'Fax (864) 751-5784', '1317 Waterfall, #2334', 'Orlando, FL 32804', '[email protected]', 'Betn, S. Raul', 'Fax (864) 666-4484', 'S. Raul Baron, P.A.', '1456 Edgewater Drive, #2034', 'Orlando, FL 32804', '[email protected]', 'Abueno, Daniel George', '(444) 123-2633', '3456 Robert Drive', '#2316', 'Charleston, SC 29492', '[email protected]']

And now Im stucked trying to create a DataFrame with the following columns:

last_name,first_name,company,phone_number,fax_number,address,city,state,zip_code,email

I don't know if its very complicated or what, but the main problems seems to be that someones have company name while others no, the same with fax and others elementes.

Also some address breaks into 2 lines or 2 elements of the series.

Solution

Apart from the fact that all kinds of info do not always exist, they are also not presented in the same order, so it is hard to know what's what for sure (company name or address, for instance).

But, with the toy data you provided:

results = [
    "John, Doe",
    "(803) 470-9419",
    "Company 1",
    "PO Box 23425",
    "Columbia, SC 14550",
    "[email protected]",
    "Aaron, Rust. ",
    "Fax (864) 751-5784",
    "1317 Waterfall, #2334",
    "Orlando, FL 32804",
    "[email protected]",
    "Betn, S. Raul",
    "Fax (864) 666-4484",
    "S. Raul Baron, P.A.",
    "1456 Edgewater Drive, #2034",
    "Orlando, FL 32804",
    "[email protected]",
    "Abueno, Daniel George",
    "(444) 123-2633",
    "3456 Robert Drive",
    "#2316",
    "Charleston, SC 29492",
    "[email protected]",
]

Here is one way to get most of what you want as a dataframe:

import pandas as pd

# Find index of email addresses
idx = [i for i, item in enumerate(results) if "@" in item]

# Split data on email addresses
data = []
i = 0
for j in idx:
    data.append(results[i : j + 1])
    i = j + 1

# Iterate on each list in data and populate a dictionary
clean_data = []
for person in data:
    infos = {"misc": []}
    infos["identity"] = person[0]
    infos["state"] = person[-2].split(", ")[0]
    infos["zip_code"] = person[-2].split(", ")[1]
    for item in person[1:-2] + [person[-1]]:
        if "@" in item:
            infos["email"] = item
        elif item.startswith("Fax"):
            infos["fax_number"] = item
        elif (
            item.replace("(", "")
            .replace(")", "")
            .replace(" ", "")
            .replace("-", "")
            .isnumeric()
        ):
            infos["phone_number"] = item
        else:
            infos["misc"].append(item)
    clean_data.append(infos)

# Convert dictionary to dataframe and reorder columns
df = pd.DataFrame(clean_data).reindex(
    columns=[
        "identity",
        "phone_number",
        "fax_number",
        "state",
        "zip_code",
        "email",
        "misc",
    ]
)

Then:

print(df)
# Output

                identity    phone_number          fax_number       state   
0              John, Doe  (803) 470-9419                 NaN    Columbia  \
1          Aaron, Rust.              NaN  Fax (864) 751-5784     Orlando   
2          Betn, S. Raul             NaN  Fax (864) 666-4484     Orlando   
3  Abueno, Daniel George  (444) 123-2633                 NaN  Charleston   

   zip_code                 email   
0  SC 14550       [email protected]  \
1  FL 32804       [email protected]   
2  FL 32804  [email protected]   
3  SC 29492   [email protected]   

                                                 misc  
0                           [Company 1, PO Box 23425]  
1                             [1317 Waterfall, #2334]  
2  [S. Raul Baron, P.A., 1456 Edgewater Drive, #2034]  
3                          [3456 Robert Drive, #2316]