Search code examples
pythonmachine-learningclassificationrankingscikit-learn

Machine Learning Email Prioritization - Python


I have been working on a Python coded priority email inbox, with the ultimate aim of using a machine learning algorithm to label (or classify) a selection of emails as either important or un-important. I will begin with some background information and then move into my question.

I have so far developed code to extract data from an email and process it to discover the most important ones. This is achieved using the following email features:

  • Senders Address Frequency
  • Thread Activity
  • Date Received (time between replies)
  • Common Words in body/subject

The code I have currently applies a ranking (or weighting) (value 0.1-1) to each email based on its importance and then applies a label of either ‘important’ or ‘un-important’ (In this case this is just 1 or 0). The status of priority is awarded if the rank is >0.5. This data is stored in a CSV file (as below).

     From           Subject       Body        Date          Rank    Priority 
     test@test.com  HelloWorld    Body Words  10/10/2012    0.67    1
     rest@test.com  ByeWorld      Body Words  10/10/2012    0.21    0
     best@test.com  SayWorld      Body Words  10/10/2012    0.91    1
     just@test.com  HeyWorld      Body Words  10/10/2012    0.48    0
     etc        …………………………………………………………………………

I have two sets of email data (One Training, One Testing). The above applies to my training email data. I am now attempting to train a learning algorithm so that I can predict the importance of the testing data.

To do this I have been looking at both SCIKIT and NLTK. However, I am having trouble transferring the information I have learnt in the tutorials and implementing into my project. I have no particular requirements in regards to which learning algorithm is used. Is this as simple as applying the following? And if so how?

   X, y = email.data, email.target

   from sklearn.svm import LinearSVC
   clf = LinearSVC()

   clf = clf.fit(X, y)

   X_new = [Testing Email Data]

   clf.predict(X_new)

Solution

  • The easiest (though probably not the fastest) solution(*) is to use scikit-learn's DictVectorizer. First, read in each sample with Python's csv module, and build a dict containing (feature, value) pairs, while keeping the priority separate:

    # UNTESTED CODE, may contain a bug or two; also, you need to decide how to
    # implement split_words
    datareader = csv.reader(csvfile)
    dicts = []
    y = []
    
    for row in datareader:
        y.append(row[-1])
        d = {"From": row[0]}
        for word in split_words(row[1]):
            d["Subject_" + word] = 1
        for word in split_words(row[2]):
            d["Body_" + word] = 1
        # etc.
        dicts.append(d)
    
    # vectorize!
    vectorizer = DictVectorizer()
    X_train = vectorizer.fit_transform(dicts)
    

    You now have a sparse matrix X_train that, together with y, you can feed to a scikit-learn classifier.

    Be aware:

    1. When you want to make predictions on unseen data, you must apply the same procedure and the exact same vectorizer object to it. I.e. you have to build a test_dicts object using the loop above, then do X_test = vectorizer.transform(test_dicts).

    2. I've assumed you want to predict the priority directly. Predicting the "rank" instead would be a regression problem, rather than a classification one. Some scikit-learn classifiers have a predict_proba method which will produce the probability that email are important, but you can't train those on the ranks.

    (*) I am the author of scikit-learn's DictVectorizer, so this is not unbiased advice. It is from the horse's mouth, though :)