Search code examples
pythonparsingnlpeml

Reading .eml files with Python 3.6 using emaildata 0.3.4


I am using python 3.6.1 and I want to read in email files (.eml) for processing. I am using the emaildata 0.3.4 package, however whenever I try to import the Text class as in the documentation, I get the module errors:

import email
from email.text import Text
>>> ModuleNotFoundError: No module named 'cStringIO'

When I tried to correct using this update, I get the next error relating to mimetools

>>> ModuleNotFoundError: No module named 'mimetools'

Is it possible to use emaildata 0.3.4 with python 3.6 to parse .eml files? Or are there any other packages I can use to parse .eml files? Thanks


Solution

  • Using the email package, we can read in the .eml files. Then, use the BytesParser library to parse the file. Finally, use a plain preference (for plain text) with the get_body() method, and get_content() method to get the raw text of the email.

    import email
    from email import policy
    from email.parser import BytesParser
    import glob
    file_list = glob.glob('*.eml') # returns list of files
    with open(file_list[2], 'rb') as fp:  # select a specific email file from the list
        msg = BytesParser(policy=policy.default).parse(fp)
    text = msg.get_body(preferencelist=('plain')).get_content()
    print(text)  # print the email content
    >>> "Hi,
    >>> This is an email
    >>> Regards,
    >>> Mister. E"
    

    Granted, this is a simplified example - no mention of HTML or attachments. But it gets done essentially what the question asks and what I want to do.

    Here is how you would iterate over several emails and save each as a plain text file:

    file_list = glob.glob('*.eml') # returns list of files
    for file in file_list:
        with open(file, 'rb') as fp:
            msg = BytesParser(policy=policy.default).parse(fp)
            fnm = os.path.splitext(file)[0] + '.txt'
            txt = msg.get_body(preferencelist=('plain')).get_content()
            with open(fnm, 'w') as f:
                print('Filename:', txt, file = f)