Search code examples
pythonpandascsvpdfminertabula-py

PDF to CSV - converted CSV has interchanged column Contents


I am trying to convert a PDF file into CSV using python and written below code for the same. Earlier it was working however recently its not working. I am getting interchanged column contents in the converted CSV file.

Guide me to fix this column issue in my code.

#!/usr/bin/env python3
import tabula
import pandas as pd
import csv

pdf_file='/pdf2xls/Input.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
              'Net Wt.kg','Blender','Remarks','Operator']

# Page 1 processing
df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
                                                                      410,450,480,520]
                     ,pandas_options={'header': None}) #(top,left,bottom,right)

df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
#df1[0].head(2)

#df1[0].to_csv('result.csv')

result = pd.DataFrame(df1[0]) # concate both the pages and then write to CSV
result.to_csv("/pdf2xls/Input.csv")

Solution

  • You can use :

    # pip install pdfplumber
    import pdfplumber
    
    pdf = pdfplumber.open(pdf_file)
    tables = pdf.pages[0].extract_tables()
    
    (
        pd.DataFrame(
            # get the second table and skip the last three rows
            data=tables[1][:-3],
            # get the last row of the first table
            columns=tables[0][-1]
        )
        .replace("", float("nan")) # get rid of the empty strings
        # .to_csv("out.csv", index=False) # uncomment to make a fresh csv
    )
    

    Output :

       Product    Batch No Machin\nNo   Time        Date Drum/\nBag\nNo Tare\nWt.kg Gross\nWt.kg Net\nWt.kg  Blender Operator
    0    L1050  23JJ0AL051     WB-102  01:07  16-10-2023              1       57.20      1398.80    1341.60      NaN     Amit
    1    L1050  23JJ0AL051     WB-102  01:22  16-10-2023              2       57.40      1398.80    1341.40      NaN     Amit
    2    L1050  23JJ0AL051     WB-102  01:33  16-10-2023              3       58.20      1399.60    1341.40      NaN     Amit
    3    L1050  23JJ0AL051     WB-102  01:44  16-10-2023              4       58.80      1400.60    1341.80      NaN     Amit
    4    L1050  23JJ0AL051     WB-102  01:55  16-10-2023              5       57.20      1399.00    1341.80      NaN     Amit
    ..     ...         ...        ...    ...         ...            ...         ...          ...        ...      ...      ...
    20   L1050  23JJ0AL051     WB-102  05:42  16-10-2023             21       57.40      1398.60    1341.20      NaN     Amit
    21   L1050  23JJ0AL051     WB-102  05:52  16-10-2023             22       57.40      1399.00    1341.60      NaN     Amit
    22   L1050  23JJ0AL051     WB-102  06:00  16-10-2023             23       57.40      1398.80    1341.40      NaN     Amit
    23   L1050  23JJ0AL051     WB-102  06:10  16-10-2023             24       57.80      1399.60    1341.80      NaN     Amit
    24   L1050  23JJ0AL051     WB-102  06:19  16-10-2023             25       57.80      1399.40    1341.60      NaN     Amit
    
    [25 rows x 11 columns]