Search code examples
scikit-learnbigdatacluster-analysisdata-miningpytables

Clustering huge data matrix in python?


I want to cluster 1,5 million of chemical compounds. This means having 1.5 x 1.5 Million distance matrix...

I think I can generate such a big table using pyTables but now - having such a table how will I cluster it?

I guess I can't just pass pyTables object to one of scikit learn clustering methods...

Are there any python based frameworks that would take my huge table and do something useful (lie clustering) with it? Perhaps in distributed manner?


Solution

  • Maybe you should look at algorithms that don't need a full distance matrix.

    I know that it is popular to formulate algorithms as matrix operations, because tools such as R are rather fast at matrix operation (and slow on other things). But there is a whole ton of methods that don't require O(n^2) memory...