Search code examples
python-3.xpandascsvmerging-data

Merge emails in text format into one csv file for machine learning


I am using the Enron dataset for a machine learning problem. I want to merge all the spam files into a single csv file and all the ham files into another single csv for further analysis. I'm using the dataset listed here: https://github.com/crossedbanana/Enron-Email-Classification

I used the code below to merge the emails and I'm able to merge them. However when I try to read the csv file and load it into pandas, I get errors due to ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2

Code to merge email files in txt into csv

import os
for f in glob.glob("./dataset_temp/spam/*.txt"):
    os.system("cat "+f+" >> OutFile1.csv")

Code to load into pandas:

```# reading the csv into pandas

emails = pd.read_csv('OutFile1.csv')
print(emails.shape)```

1. How can I get rid of the parser error? this is occuring due to commas present in the email messages I think.
2. How can I just load each email message into pandas with just the email body?

This is how the email format looks like(an example of a text file in the spam folder)
The commas in line 3 are causing a problem while loading into pandas


*Subject: your prescription is ready . . oxwq s f e
low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu* 


Thanks for any help. 

Solution

  • I solved my problem this way. Read all the txt files first

    ```
    BASE_DIR = './'
    SPAM_DIR = './spam'
     def load_text_file(filenames):
            text_list = []
            for filename in filenames:
                 with codecs.open(filename, "r", "utf-8", errors = 'ignore') as f:
                     text = f.read().replace('\r\n', ' ')
                     text_list.append(text)
        return text_list
    
    # add it to a list with filenames 
    ham_filenames = glob.glob( BASE_DIR + HAM_DIR + '*.txt')
    ham_list = load_text_file(ham_filenames)
    
    # load the list into a dataframe
    df = DataFrame (train_list,columns=['emails'])
    ```
    

    Once I had it in a dataframe, I just parsed the emails into subject and body. Thanks to everyone for their help.