Search code examples
pythonpandaspdftabula-py

Extracting tables from PDF using tabula-py fails to properly detect rows


Problem

I want to extract a 70-page vocabulary table from a PDF and turn it into a CSV to use in [any vocabulary learning app]. Tabula-py and its read_pdf function is a popular solution to extract the tables, and it did detect the columns ideally without any fine-tuning. But, it only detected the columns well and had difficulties with the multi-line rows, splitting each line into a different row.

E.g., in the PDF you will have columns 2 and 3. The table on Stackoverflow doesn't seem to allow multi-line content either, so I added row numbers. Just merge the row 1 in your head.

Row number German Latin
1 First word Translation for first word
1 with many lines of content [phonetic vocabulary thingy]
1 and more lines
2 Second word Translation for second word

Instead of fine-tuning the read_pdf parameters, are there ways around that?


Solution

  • Possible solution

    Instead of experimenting with tabula-py, which is perfectly legit of course, you can export a pdf in Adobe Reader using File->Export a PDF->HTML Web Page

    You then read it using

    import pandas as pd
    
    dfs = pd.read_html("file.html", header=0,encoding='utf-8')
    

    to get a list of pandas dataframes. You could also use BeautifulSoup4 or similar solutions to extract the tables.

    To match tables with the same column names (e.g., in a vocabulary table) and save them as csv, you can do this:

    
    from collections import defaultdict
    unique_columns_to_dataframes = defaultdict(list)
    
    # We need to get a hashable key for the dictionary, so we join the df.columns.values. Strings can be hashed.
    possible_column_variations = [("%%".join(list(df.columns.values)), i) for i, df in enumerate(dfs)]
    
    for k, v in possible_column_variations:
        unique_columns_to_dataframes[k].append(v)
        
    for k, v in unique_columns_to_dataframes.items():
        new_df = pd.concat([dfs[i] for i in v])
        new_df.reset_index(drop=True,inplace=True)
        # Save file with a unique name. Unique name is a hash out from the characters in the column_names, not collision-free but unlikely to collide for small number of tables
        new_df.to_csv("Df_"+str(sum([ord(c) for c in k]))+".csv", index=False, sep=";", encoding='utf-8')