python python-3.x scikit-learn cluster-analysis

Dealing with Memory Error (Python sklearn clustering)

I have a dataset each of datum has sparse labels. So, below is how data looks like.

[["Snow","Winter","Freezing","Fun","Beanie","Footwear","Headgear","Fur","Playing in the snow","Photography"],["Tree","Sky","Daytime","Urban area","Branch","Metropolitan area","Winter","Town","City","Street light"],...]

The total numbers of labels are around 50, and the numbers of data are 200K. And I want to cluster this data, but I'm having trouble dealing with that.

I want to cluster that data with four clustering algorithms(AgglomerativeClustering, SpectralClustering, MiniBatchKMeans, KMeans), but none of these worked because of memory issues.

Below is my code.

from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
import json

NUM_OF_CLUSTERS = 10

with open('./data/sample.json') as json_file:
    json_data = json.load(json_file)
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in json_data:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))

X = csr_matrix((data, indices, indptr), dtype=int).toarray()

# None of these algorithms work properly. I think it's because of memory issues.
# miniBatchKMeans = MiniBatchKMeans(n_clusters=NUM_OF_CLUSTERS, n_init=5, random_state=0).fit(X)
# agglomerative = AgglomerativeClustering(n_clusters=NUM_OF_CLUSTERS).fit(X)
# spectral = SpectralClustering(n_clusters=NUM_OF_CLUSTERS, assign_labels="discretize", random_state=0).fit(X)
#
# print(miniBatchKMeans.labels_)
# print(agglomerative.labels_)
# print(spectral.labels_)
with open('data.json', 'w') as outfile:
    json.dump(miniBatchKMeans.labels_.tolist(), outfile)

Are there any solutions or other recommendations for my problem?

Solution

What is the size of X?

With toarray() you are converting the data into a sense format. That significantly increases the memory requirements.

With 200k instances you cannot use spectral clustering not affiniy propagation, because these need O(n²) memory. So either you choose other algorithms or subsample your data. Obviously there is also no use in doing both kmeans and minibatch kmeans (which is an approximation to kmeans). Use only one.

To efficiently work with sparse data, you may need to implement the algorithms yourself. Kmeans is designed for dense data, so it makes sense to time the implementation for dense data by default. In fact, using the mean on sparse data is rather questionable. So I'd not expect the results to be very good on your data with kmeans either.