Search code examples
nlptext-analysistopic-modelingunsupervised-learningcategorization

Find the common topic(s) of a set of related Wikipedia articles.


I have an unlabelled data-set consisting of thousands of Wikipedia articles.
These articles are grouped into sets of articles that are closely related in terms of their content.
Given one of these sets, I want to determine the common topic(s) that all of its articles belong to.

Example:
Given the following set of related articles by their title:

{Calculus, matrices, number theory}

I can determine that a common topic is mathematics.

Is there a simple way to do this programmatically by analysing the text of each article?
It doesn’t need to be super accurate and precise.
If this is not possible, a list of words that most accurately represent the set of related articles should suffice.


Solution

  • A standard way to assign cluster labels is to sort (in descending order) terms in these articles by their tf-idf scores, and then report the top three as the most likely descriptive words for that cluster.

    More precisely, you can use the following tf-idf term score, where tf(t, C) is the weight of term 't' in cluster 'C'.

    score(t, C) = log (1 + \lambda/(1-\lambda) * tf(t, C)/\sum_{t' in C} tf(t', C) * cs/cf(t))
    

    Here, tf(t, C)/\sum_{t' in C} tf(t', C) simply denotes the maximum likelihood of sampling term t from cluster C, and cs/cf(t) denotes the ratio of the collection size to the collection frequency of term 't' (note that if t is relatively uncommon in other clusters, this value is high because of low cf(t)).

    Thus, the more frequent a term is in this cluster (probably 'mathematics' be a term common in all the documents of your example cluster), and the more this term is uncommon in the rest of the clusters (the term 'mathematics' is likely to be rare in other ones), this term is likely to be chosen as a representative term as the cluster label.

    You can use lambda to control the relative importance that you might want to associate with the term frequency component; a good choice of lambda is 0.6.