algorithm data-structures cluster-analysis

How to cluster large datasets

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.

What would be the best way to approach this? I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.

Are there any cluster algorithms suitable for larger datasets?

For reference: I am using Elasticsearch to store my data.

Solution

According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:

Combination of k-means and agglomerative clustering (bottom-up)
topic modeling
co-clustering.

But I can't tell how to apply these on your dataset. It's big - good luck.

For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.

The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.

The process contains lots of intermediate steps of manual inspection.