Search code examples
pandasdataframepdfweb-scrapingseries

How to parse info in a PDF and make a dataframe?


Im very new into this and Im trying to scrape a pdf and create a DataFrame with the info. I can get a pandas series from the scrape, I do with the following code:

import pandas as pd
import tika
from tika import parser
tika.initVM()

file_name = "the_pdf_im_scraping"

def input_file_processing(file_name):
    parsedPDF = parser.from_file(file_name)
    content = parsedPDF['content']
    contentlist = content.split('\n')
    contentlist = list(filter(lambda a: a != '', contentlist))
    return contentlist

contentlist = input_file_processing(file_name)

This create a series that look like the following:

["John, Doe", '(803) 470-9419', "Company 1", 'PO Box 23425', 'Columbia, SC 14550', '[email protected]', 'Aaron, Rust. ', 'Fax (864) 751-5784', '1317 Waterfall, #2334', 'Orlando, FL 32804', '[email protected]', 'Betn, S. Raul', 'Fax (864) 666-4484', 'S. Raul Baron, P.A.', '1456 Edgewater Drive, #2034', 'Orlando, FL 32804', '[email protected]', 'Abueno, Daniel George', '(444) 123-2633', '3456 Robert Drive', '#2316', 'Charleston, SC 29492', '[email protected]']

And now Im stucked trying to create a DataFrame with the following columns:

last_name,first_name,company,phone_number,fax_number,address,city,state,zip_code,email

I don't know if its very complicated or what, but the main problems seems to be that someones have company name while others no, the same with fax and others elementes.

Also some address breaks into 2 lines or 2 elements of the series.


Solution

  • Apart from the fact that all kinds of info do not always exist, they are also not presented in the same order, so it is hard to know what's what for sure (company name or address, for instance).

    But, with the toy data you provided:

    results = [
        "John, Doe",
        "(803) 470-9419",
        "Company 1",
        "PO Box 23425",
        "Columbia, SC 14550",
        "[email protected]",
        "Aaron, Rust. ",
        "Fax (864) 751-5784",
        "1317 Waterfall, #2334",
        "Orlando, FL 32804",
        "[email protected]",
        "Betn, S. Raul",
        "Fax (864) 666-4484",
        "S. Raul Baron, P.A.",
        "1456 Edgewater Drive, #2034",
        "Orlando, FL 32804",
        "[email protected]",
        "Abueno, Daniel George",
        "(444) 123-2633",
        "3456 Robert Drive",
        "#2316",
        "Charleston, SC 29492",
        "[email protected]",
    ]
    

    Here is one way to get most of what you want as a dataframe:

    import pandas as pd
    
    # Find index of email addresses
    idx = [i for i, item in enumerate(results) if "@" in item]
    
    # Split data on email addresses
    data = []
    i = 0
    for j in idx:
        data.append(results[i : j + 1])
        i = j + 1
    
    # Iterate on each list in data and populate a dictionary
    clean_data = []
    for person in data:
        infos = {"misc": []}
        infos["identity"] = person[0]
        infos["state"] = person[-2].split(", ")[0]
        infos["zip_code"] = person[-2].split(", ")[1]
        for item in person[1:-2] + [person[-1]]:
            if "@" in item:
                infos["email"] = item
            elif item.startswith("Fax"):
                infos["fax_number"] = item
            elif (
                item.replace("(", "")
                .replace(")", "")
                .replace(" ", "")
                .replace("-", "")
                .isnumeric()
            ):
                infos["phone_number"] = item
            else:
                infos["misc"].append(item)
        clean_data.append(infos)
    
    # Convert dictionary to dataframe and reorder columns
    df = pd.DataFrame(clean_data).reindex(
        columns=[
            "identity",
            "phone_number",
            "fax_number",
            "state",
            "zip_code",
            "email",
            "misc",
        ]
    )
    

    Then:

    print(df)
    # Output
    
                    identity    phone_number          fax_number       state   
    0              John, Doe  (803) 470-9419                 NaN    Columbia  \
    1          Aaron, Rust.              NaN  Fax (864) 751-5784     Orlando   
    2          Betn, S. Raul             NaN  Fax (864) 666-4484     Orlando   
    3  Abueno, Daniel George  (444) 123-2633                 NaN  Charleston   
    
       zip_code                 email   
    0  SC 14550       [email protected]  \
    1  FL 32804       [email protected]   
    2  FL 32804  [email protected]   
    3  SC 29492   [email protected]   
    
                                                     misc  
    0                           [Company 1, PO Box 23425]  
    1                             [1317 Waterfall, #2334]  
    2  [S. Raul Baron, P.A., 1456 Edgewater Drive, #2034]  
    3                          [3456 Robert Drive, #2316]