Search code examples
algorithmartificial-intelligencecluster-analysisclassificationsimilarity

how to categorize but don't use Classification or Clustering algorithms?


I have a crawler program that stores sport data from 7 difference news agencies every day. it stores about 1200 sport news every day. I want to categorize news of last two days into sub-categories. So every two days I have about 2400 news that are exactly for these days and many of their topics are talking exactly about the same event. for example:

70 news are talking about 500 miles racing of Brad Keselowski.

120 news are talking about US swimmer Nyad that begins swimming.

28 new are talking about the match between Man United and Man City.

. . .

In other words, I want to make something like Google News.

The problem is that this situation is not a classification problem, because I don't have special classes. for example, my classes are not swimming, golf, football, etc. my classes are a special events in every field that happened in these two years. So I cannot use classification algorithms such as Naive Bayes.

On the other hand, my problem is not solving with clustering algorithms too. Because I don't want to force them to put into n clusters. Maybe one of the news doesn't have any similar news or maybe in one pack of two days, there are 12 different stories, but in other two days, there are 30 different issues. So I cannot use clustering algorithms such as "Single Link( Maximum Similarity)", "Complete Link( Minimum Similarity)", "Maximum Weighted Matching" or "Group Average( Average Intra Similarity)".

I have some ideas myself to do this, for example, each two news that have 10 common words, should be in the same class. But if we don't consider some parameters such as length of documents, influence of common and rare words and some other things, this will not work well.

I have read this paper, but it was not my answer.

Is there any known algorithm to solve this problem?


Solution

  • The problem strikes me as a clustering problem with an unknown quality measure for the clusters. That points to an unsupervised method, which is ultimately based on detecting correlations using redundancy in the data. Perhaps something like principal component analysis or latent semantic analysis could be useful. The different dimensions (principal components or singular vectors) would indicate distinct major themes, with the terms corresponding to the vector components hopefully being the words appearing in the description. One drawback is that there's no guarantee that the strongest correlations would lead easily to a sensible description.