python machine-learning scikit-learn cosine-similarity

Predict interesting articles with scikit-learn

I'm trying to build an algorithm capable of predicting if I will like an article, based on the previous articles I liked.

Example:

I read 50 articles, I liked 10. I tell my program I liked them.
Then 20 new articles are coming. My program has to give me a "percentage of like" for each new articles, based on the 10 I previously liked.

I found a lead here: Python: tf-idf-cosine: to find document similarity

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()
>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)

And then, to compare the first document of the dataset to the others documents in the dataset:

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

For my case, what I think I will do is to concatenate the text of my 10 articles, run the TfidfVectorizer, and then compare the new big vector to each new article coming.

But I wonder how the comparison will be done:

the big vector (10 articles) compared to the little one OR
the little one compared to the big one

I don't know if you get my point, but in the first case 90 % of the words in the big vector won't be in the little one.

So my question is: how is the cosine similarity calculated ? Do you see a better approach for my project ?

Solution

A naive bayes classifier should perform better. Your problem is similar to the classic spam-classification problem. In your case, you are not identifying Spam (what you don't like) but identifying Ham (what article you like).

From the first 50 labeled articles, it's easy to calculate the below stats:

p(word1|like)   -- among all the articles I like, the probability of word1 appears
p(word2|like)   -- among all the articles I like, the probability of word2 appears
...
p(wordn|like)   -- among all the articles I like, the probability of wordn appears

p(word1|unlike) -- among all the articles I do not like, the prob of word1 appears
...

p(like)  -- the portion of articles I like (should be 0.2 in your example)
p(unlike) -- the portion of articles I do not like. (0.8)

Then given a 51th new example, you should find all seen words in it, for example, it contains only word2 and word5. One of the nice things about the naive bayes is that it only cares the in-vocabulary words. Even more than 90% of the words in the big vector won't be in the new one, it is not a problem since all irrelevant features cancel each other out without affecting results.

The likelihood ratio will be

   prob(like|51th article)      p(like) x p(word2|like) x p(word5|like)
 ---------------------------- = -----------------------------------------
   prob(unlike|51th article)    p(unlike)xp(word2|unlike)xp(word5|unlike)

As long as the ratio is > 1, you can predict the article as "like". Further, if you want to increase the precision of identifying "liked" articles, you can play with the precision-recall balance by increase the threshold ratio values from 1.0 to a bigger value. On the other direction, if you want to increase the recall, you can lower the threshold etc.

For further reading for naive bayes classification in text domain, see here.

This algorithm can easily be modified to do online learning, i.e., updating the learned model as soon as a new example is "liked" or "disliked" by the user. Since every thing in the above stats table are basically normalized counts. As long as you keep each counts (per word) and the total counts saved, you are able to update the model on a per-instance basis.

To use tf-idf weight of a word for naive bayes, we treat the weight as the count of the word. I.e., without tf-idf, each word in each document counted as 1; with tf-idf, the words in the documents are counted as their TF-IDF weight. Then you get your probabilities for Naive Bayes using the same formula. This idea can be found in this paper. I think the multinomial naive bayes classifier in scikit-learn should accept tf-idf weights as input data.

See the comment for the MultinomialNB:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.