Search code examples
pythonpandascsvpdfpdfplumber

How to Convert PDF file into CSV file using Python Pandas


I have a PDF file, I need to convert it into a CSV file this is my pdf file example as link https://online.flippingbook.com/view/352975479/ the code used is

import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0

with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            print(line)

with the above script I am not getting proper output, For Time column "AM" is getting in the next line. The output I am getting is like this

[1]: https://i.sstatic.net/25Yxc.png


Solution

  • For cases like these, build a parser that converts the unusable data into something you can use.

    Logic below converts that exact file to a CSV, but will only work with that specific file contents.

    Note that for this specific file you can ignore the AM/PM as the time is in 24h format.

    import pdfplumber
    
    
    file = "Battery Voltage.pdf"
    skiplines = [
        "Battery Voltage",
        "AM",
        "PM",
        "Sr No DateTIme Voltage (v) Ignition",
        ""
    ]
    
    
    with open("output.csv", "w") as outfile:
        header = "serialnumber;date;time;voltage;ignition\n"
        outfile.write(header)
        with pdfplumber.open(file) as pdf:
            for page in pdf.pages:
                for line in page.extract_text().split('\n'):
                    if line.strip() in skiplines:
                        continue
                    outfile.write(";".join(line.split())+"\n")
    

    EDIT

    So, JSON files in python are basically just a list of dict items (yes, that's oversimplification).

    The only thing you need to change is the way you actually process the lines. The actual meat of the logic doesn't change...

    import pdfplumber
    import json
    
    
    file = "Battery Voltage.pdf"
    skiplines = [
        "Battery Voltage",
        "AM",
        "PM",
        "Sr No DateTIme Voltage (v) Ignition",
        ""
    ]
    result = []
    
    
    with pdfplumber.open(file) as pdf:
        for page in pdf.pages:
            for line in page.extract_text().split("\n"):
                if line.strip() in skiplines:
                    continue
                serialnumber, date, time, voltage, ignition = line.split()
                result.append(
                    {
                        "serialnumber": serialnumber,
                        "date": date,
                        "time": time,
                        "voltage": voltage,
                        "ignition": ignition,
                    }
                )
    
    with open("output.json", "w") as outfile:
        json.dump(result, outfile)