Search code examples
pythoncsvparsingtext-files

Parse text file of contacts and transpose lines into columns with Python


I have a file of sms messages in plain text that I want to convert to a CSV file. The format is as follows:

Sent on 1/1/2023 7:30:33 AM to Person1

Message

-----

Received on 5/20/2023 4:55:33 PM from Person1

Message

I want to loop through the text file, retrieve the lines and create a result like the one below.

Status Date Contact Message
Sent 1/1/2023 7:30:33 AM Person1 Message
Received 5/20/2023 4:55:33 PM Person1 Message

I started with the code below, but I'm pretty new to Python and can't figure out how to transpose the lines into columns.

import csv
import openpyxl

input_file = 'output.txt'
output_file = 'allmessages.csv'

wb = openpyxl.Workbook()
ws = wb.worksheets[0]

with open(input_file, 'r') as data:
    reader = csv.reader(data, delimiter='\t')
    for row in reader:
        ws.append(row)
        
wb.save(output_file)

Any suggestions would be greatly appreciated!


Solution

  • If your data is very consistent like you showed, then @Jam's regexp will probably do.

    If you find your data less than normal and the regexp failing, you'll probably want a little state machine that understands the structure of the input TXT:

    • blank lines can be skipped
    • "-----" is the message boundary, the point in the text where everything that directly precedes makes up a complete output row
    • "Sent" or "Received" are a header, and mark the beginning of a message
    • every other kind of line should be considered to be part of the message
    import csv
    import re
    
    
    def is_header(line: str) -> bool:
        if line.startswith("Sent on"):
            return True
        if line.startswith("Received on"):
            return True
        return False
    
    
    def get_header(line: str) -> list[str]:
        # you might be okay with line.strip().split('\t')
        fields = re.split(r"\s+", line.strip())
    
        # Sent on 5/20/2023 4:55:33 PM from Person1
        # 0    1  2         3       4  5    6
        status = fields[0]
        date = " ".join(fields[2:5])
        contact = fields[6]
    
        return [status, date, contact]
    
    
    def is_boundary(line: str) -> bool:
        return line.strip() == "-----"
    
    
    # Initialize rows w/CSV header
    rows = [
        ["Status", "Date", "Contact", "Message"],
    ]
    
    with open("input.txt") as f:
        header: list[str] = []
        msg = ""
    
        for line in f:
            # Skip blank lines
            if not line.strip():
                continue
    
            # Flush any data, reset
            if is_boundary(line) and header and msg:
                rows.append(header + [msg])
                header = []
                msg = ""
                continue
    
            if is_header(line):
                header = get_header(line)
                continue
    
            msg += line
    
    
    # Flush any trailing data
    if header and msg:
        rows.append(header + [msg])
    
    
    with open("output.csv", "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(rows)
    

    You'll see a lot of line.strip() because I want to preserve the whitespace when line represents a part of the message, and that's assuming you can have multi-line messages, like this:

    Sent on 1/1/2023 7:30:33 AM to Person1
    
    Message1 line1
    
    -----
    
    Received on 5/20/2023 4:55:33 PM from Person1
    
    Message2 line1
    Message2 line2
    -----
    Sent on 2/9/2023 5:20:12 AM to Person2
    
    Message1 line1
    Message1 line2
    Message1 line3
    
    -----
    Received on 7/4/2023 9:04:00 PM from Person2
    
    
    Message2 line1
    Message2 line2
    
    
    -----
    

    If your data is as normal as your example, you might like one line = line.strip() at the top of the loop. And then remove all the other .strip() calls.

    I made that input TXT especially messy to highlight how flexible the parser is. From that input, it produces this CSV:

    +----------+----------------------+---------+----------------+
    | Status   | Date                 | Contact | Message        |
    +----------+----------------------+---------+----------------+
    | Sent     | 1/1/2023 7:30:33 AM  | Person1 | Message1 line1 |
    |          |                      |         |                |
    +----------+----------------------+---------+----------------+
    | Received | 5/20/2023 4:55:33 PM | Person1 | Message2 line1 |
    |          |                      |         | Message2 line2 |
    |          |                      |         |                |
    +----------+----------------------+---------+----------------+
    | Sent     | 2/9/2023 5:20:12 AM  | Person2 | Message1 line1 |
    |          |                      |         | Message1 line2 |
    |          |                      |         | Message1 line3 |
    |          |                      |         |                |
    +----------+----------------------+---------+----------------+
    | Received | 7/4/2023 9:04:00 PM  | Person2 | Message2 line1 |
    |          |                      |         | Message2 line2 |
    |          |                      |         |                |
    +----------+----------------------+---------+----------------+