I have a file of sms messages in plain text that I want to convert to a CSV file. The format is as follows:
Sent on 1/1/2023 7:30:33 AM to Person1
Message
-----
Received on 5/20/2023 4:55:33 PM from Person1
Message
I want to loop through the text file, retrieve the lines and create a result like the one below.
Status | Date | Contact | Message |
---|---|---|---|
Sent | 1/1/2023 7:30:33 AM | Person1 | Message |
Received | 5/20/2023 4:55:33 PM | Person1 | Message |
I started with the code below, but I'm pretty new to Python and can't figure out how to transpose the lines into columns.
import csv
import openpyxl
input_file = 'output.txt'
output_file = 'allmessages.csv'
wb = openpyxl.Workbook()
ws = wb.worksheets[0]
with open(input_file, 'r') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
ws.append(row)
wb.save(output_file)
Any suggestions would be greatly appreciated!
If your data is very consistent like you showed, then @Jam's regexp will probably do.
If you find your data less than normal and the regexp failing, you'll probably want a little state machine that understands the structure of the input TXT:
import csv
import re
def is_header(line: str) -> bool:
if line.startswith("Sent on"):
return True
if line.startswith("Received on"):
return True
return False
def get_header(line: str) -> list[str]:
# you might be okay with line.strip().split('\t')
fields = re.split(r"\s+", line.strip())
# Sent on 5/20/2023 4:55:33 PM from Person1
# 0 1 2 3 4 5 6
status = fields[0]
date = " ".join(fields[2:5])
contact = fields[6]
return [status, date, contact]
def is_boundary(line: str) -> bool:
return line.strip() == "-----"
# Initialize rows w/CSV header
rows = [
["Status", "Date", "Contact", "Message"],
]
with open("input.txt") as f:
header: list[str] = []
msg = ""
for line in f:
# Skip blank lines
if not line.strip():
continue
# Flush any data, reset
if is_boundary(line) and header and msg:
rows.append(header + [msg])
header = []
msg = ""
continue
if is_header(line):
header = get_header(line)
continue
msg += line
# Flush any trailing data
if header and msg:
rows.append(header + [msg])
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(rows)
You'll see a lot of line.strip()
because I want to preserve the whitespace when line represents a part of the message, and that's assuming you can have multi-line messages, like this:
Sent on 1/1/2023 7:30:33 AM to Person1
Message1 line1
-----
Received on 5/20/2023 4:55:33 PM from Person1
Message2 line1
Message2 line2
-----
Sent on 2/9/2023 5:20:12 AM to Person2
Message1 line1
Message1 line2
Message1 line3
-----
Received on 7/4/2023 9:04:00 PM from Person2
Message2 line1
Message2 line2
-----
If your data is as normal as your example, you might like one line = line.strip()
at the top of the loop. And then remove all the other .strip()
calls.
I made that input TXT especially messy to highlight how flexible the parser is. From that input, it produces this CSV:
+----------+----------------------+---------+----------------+
| Status | Date | Contact | Message |
+----------+----------------------+---------+----------------+
| Sent | 1/1/2023 7:30:33 AM | Person1 | Message1 line1 |
| | | | |
+----------+----------------------+---------+----------------+
| Received | 5/20/2023 4:55:33 PM | Person1 | Message2 line1 |
| | | | Message2 line2 |
| | | | |
+----------+----------------------+---------+----------------+
| Sent | 2/9/2023 5:20:12 AM | Person2 | Message1 line1 |
| | | | Message1 line2 |
| | | | Message1 line3 |
| | | | |
+----------+----------------------+---------+----------------+
| Received | 7/4/2023 9:04:00 PM | Person2 | Message2 line1 |
| | | | Message2 line2 |
| | | | |
+----------+----------------------+---------+----------------+