I'm trying to ananalyse some facebook messenger data and I'm having trouble with utf-8 encoding.
import os
import json
import datetime
from tqdm import tqdm
import csv
from datetime import datetime
directory = "facebook-100071636101603/messages/inbox"
folders = os.listdir(directory)
if ".DS_Store" in folders:
folders.remove(".DS_Store")
for folder in tqdm(folders):
print(folder)
for filename in os.listdir(os.path.join(directory,folder)):
if filename.startswith("message"):
data = json.load(open(os.path.join(directory,folder,filename), "r"))
for message in data["messages"]:
try:
date = datetime.fromtimestamp(message["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
sender = message["sender_name"]
content = message["content"]
with open('output.csv', 'w', encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow([date,sender,content])
except KeyError:
pass
This script works but the output csv doesn't show the accentuated characters.
I'm very knew to this so I haven't tried a lot. I've read the Python csv documentation and found this passage:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv with open('some.csv', newline='', encoding='utf-8') as f: reader = csv.reader(f) for row in reader: print(row)
But this doesn't seems to work.
Edit : This is the output I'm getting but it should be Jørn and not Jørn and quête, not quête.
Try adding encoding="utf-8
to this line:
json.load(open(os.path.join(directory,folder,filename), "r", encoding="utf-8"))
This will ensure that every file you import is in the utf-8 encoding format
EDIT:
You need to install ftfy using pip install ftfy
. This package will fix your broken encoding.
Change sender
and content
to fix the encoding using ftfy by writing this:
import ftfy
# Your other code
sender = message["sender_name"]
content = message["content"]
sender = ftfy.fix_text(sender)
content = ftfy.fix_text(content)
You can use ftfy.fix_text(string)
for any other broken encoding as well.