Search code examples
pythonregexcsvpdfplumber

pdfplumber and regex - but regex not extracting data and writing to CSV


Using the following code I am able to obtain the list of transactions in the desired format

import re
import pdfplumber
import csv


line_re = re.compile(r"(\d{2}/\d{2}/\d{4}\s+\d{2}/\d{2}/\d{4}.+)$")


transactions = []

with pdfplumber.open('./Bank Acct statement.pdf') as pdf:
    for page in pdf.pages:
        
        text = page.extract_text()

        
        lines = text.split('\n')

        for line in lines:
            if re.match(line_re, line):
                transactions.append(line)

The above returns the following list of transactions:

26/01/2023 26/01/2023 Payment Received Z Kona 8 000.00 8 085.87
26/01/2023 26/01/2023 Banking App Payment: Sihle -2 000.00 6 085.87
26/01/2023 26/01/2023 Payment Fee -1.50 6 084.37
26/01/2023 26/01/2023 SMS Payment Notification Fee -0.25 6 084.12
26/01/2023 26/01/2023 Payment Received Z Kona 15 000.00 21 084.12
26/01/2023 26/01/2023 Payment Received Z Kona 1 500.00 22 584.12
26/01/2023 26/01/2023 Payment Received Z Kona 2 000.00 24 584.12
26/01/2023 26/01/2023 Banking App Transfer to Ms K Savings (1816578655) -18 500.00 6 084.12

However when trying to extract the components and writing to csv, the csv file is empty. This is the modified code where I try and do the entire process:

import re
import pdfplumber
import csv


line_re = re.compile(r"(\d{2}/\d{2}/\d{4}\s+\d{2}/\d{2}/\d{4}.+)$")


transactions = []

with pdfplumber.open('./Bank Acct statement.pdf') as pdf:
    for page in pdf.pages:
       
        text = page.extract_text()

       
        lines = text.split('\n')

        for line in lines:
            if re.match(line_re, line):
                
                transactions.append(line)

# Define headers for the CSV file
csv_headers = [
    "Posting Date",
    "Transaction Date",
    "Description",
    "Amount",
    "Balance",
]

# Specify the CSV file path
csv_file_path = "transactions.csv"

# Write transactions to the CSV file
with open(csv_file_path, mode="w", newline="") as csv_file:
    csv_writer = csv.writer(csv_file)
    
    # Write the headers as the first row
    csv_writer.writerow(csv_headers)
    
    # Write each transaction as a row in the CSV file
    for transaction in transactions:
        # Split the transaction line into its components
        match = re.search(
            r"(\d{2}/\d{2}/\d{4})\s+(\d{2}/\d{2}/\d{4})\s+([\w\s\(\),]+)\s+(-?\d{1,10}\s\d{1,2}\.\d{2})\s+(-?\d{1,10}\s\d{1,2}\.\d{2})",
            transaction,
        )
        if match:
            csv_writer.writerow([match.group(1), match.group(2), match.group(3), match.group(4), match.group(5)])

print(f"Transactions saved to {csv_file_path}")

The CSV only returns the headers, and all text in the first column. What am I doing wrong? I am not the best at regex.


Solution

  • Your file is empty because your pattern does not match.

    I have added named capture groups for clarity, your updated pattern could be:

    ^(?P<posting_date>\d{2}/\d{2}/\d{4})\s+(?P<transaction_date>\d{2}/\d{2}/\d{4})\s+(?P<description>[\w\s(),:]+?)\s+(?P<Amount>-?\d{1,3}(?:\s\d{3})*\.\d{2})\s+(?P<Balance>-?\d{1,3}(?:\s\d{3})*\.\d{2})$
    

    The pattern matches:

    • ^ Start of string
    • (?P<posting_date>\d{2}/\d{2}/\d{4})\s+ Match a date like pattern
    • (?P<transaction_date>\d{2}/\d{2}/\d{4})\s+ Same pattern for the transaction date
    • (?P<description>[\w\s(),:]+?)\s+ Match the allowed characters, as few as possible
    • (?P<Amount>-?\d{1,3}(?:\s\d{3})*\.\d{2})\s+ Match the amount ending on . and 2 digits with a whitespace char between 3 digits and leading 1-3 digits
    • (?P<Balance>-?\d{1,3}(?:\s\d{3})*\.\d{2}) Same pattern for the balance
    • $ End of string

    See a regex demo.

    The updated part of the code:

    match = re.search(
        r"(?P<posting_date>\d{2}/\d{2}/\d{4})\s+(?P<transaction_date>\d{2}/\d{2}/\d{4})\s+(?P<description>[\w\s(),:]+?)\s+(?P<Amount>-?\d{1,3}(?:\s\d{3})*\.\d{2})\s+(?P<Balance>-?\d{1,3}(?:\s\d{3})*\.\d{2})$",
        transaction,
    )
    

    The content of the transactions.csv file:

    Posting Date,Transaction Date,Description,Amount,Balance
    26/01/2023,26/01/2023,Payment Received Z Kona,8 000.00,8 085.87
    26/01/2023,26/01/2023,Banking App Payment: Sihle,-2 000.00,6 085.87
    26/01/2023,26/01/2023,Payment Fee,-1.50,6 084.37
    26/01/2023,26/01/2023,SMS Payment Notification Fee,-0.25,6 084.12
    26/01/2023,26/01/2023,Payment Received Z Kona,15 000.00,21 084.12
    26/01/2023,26/01/2023,Payment Received Z Kona,1 500.00,22 584.12
    26/01/2023,26/01/2023,Payment Received Z Kona,2 000.00,24 584.12
    26/01/2023,26/01/2023,Banking App Transfer to Ms K Savings (1816578655),-18 500.00,6 084.12