python machine-learning regression ranking

Ranking a list of emails' Priority

I am trying to produce a simple email ranking program (something like a priority inbox) in Python. Based on the frequency of emails received from senders, so for example have a training set of say 50%, where the frequency of senders is counted and then a test set of 50% which is ranked in order based on the training (so an email from a sender who sends lots of messages is ranked highly).

I have written some Python code to take emails and extract the 'From' address from each. I have placed this information in a list which shows the most common email senders (example snippet from this list below).

 //(Email address, frequency of emails received from this sender)//Not Code
 ('tester1@csmining.org', 244)
 ('tester2@csmining.org', 162)
 ('tester3@csmining.org', 154)
 ('tester4@csmining.org', 75)
 ('tester5@csmining.org', 50)

I am aware that a number of machine learning algorithms can be used effectively to train and test my data to do what i require. However, i am unsure which of these i can use to give me the best results?

Solution

Ranking only based on sender is never a good idea. For myself, I subscribe email notifications from github commit. Everyday, I receive hundreds of emails due to my co-worker's code commits.

This is never an easy problem, even Gmail priority inbox does not do well from my experience. A good email priority ranking or scoring system needs good features. I will suggest following features to start with. See The Learning Behind Gmail Priority Inbox:

Social features. Sender or sender domain;
Thread feature. Is this email in an active thread? What is the sequence number of this email in the thread? Who are the cc'ed users if any?
Time feature. When was this email received? If you have access to the owner's reply, you might want to keep track of how long it takes the owner to reply.
Content feature. This is the bag-of-words model used in spam filtering.
Behavior. This is how the email account owner responses to the email. Is it replied? or never read? or immediately deleted? or archived to different folders? tagged? If it is replied, you might want to do some content analysis as well. The length of the reply might also be a good feature.

For the regression model, Gmail uses linear logistic regression to keep learning and prediction scalable.

Last as what Gmail does, you can ask the users to help you to improve the system by giving them an option to mark important emails.