What is the correct way to programatically detect and correct the Content-Type
charset
in an email header in python?
I have a 1000s of emails extracted to .eml
(basically plain text) files and some are encoded shift_jis
, but the charset in the email header doesn't mention this, so they don't display correctly in any email program. Adding in the charset
manually to the Content-Type
header corrects this.
Was:
Content-Type: text/plain; format=flowed
Needs to be:
Content-Type: text/plain; charset="shift_jis"; format=flowed
What's the correct way to do this in python preserving the email body and other parts of the header?
Also, is there a way to detect which encoding, and only correct those with that encoding? I can't just convert all blindly, since some are iso_2022_jp
, and those are already displaying correctly.
With get_charset you can get the pre-existing charset of a message. Here's a sample:
from email import message_from_file
msg = message_from_file(open('path.eml'))
msg.get_charsets()
[None, 'gb2312', None]
With this approach you can loop through all messages, and using set_charset() set it to the ones that don't have it to the correct one.