Search code examples
pythonscikit-learncluster-analysissparse-matrixdbscan

Input matrix and parameters for the DBSCAN algorithm from scikit-learn


I'm new at using scikit-learn and I'm trying to clusterize people given their interest in movie. I create a sparse matrix that got different columns (one for each movie) and rows. For a given cell it's 0 or 1 if the user liked the movie or not.

sparse_matrix = numpy.zeros(shape=(len(list_user), len(list_movie)))
for id in list_user:
    index_id = list_user.index(id)
    for movie in list_movie[index_id]:
        if movie.isdigit():
            index_movie = list_movie.index(int(movie))
            sparse_matrix[index_id][index_movie] = 1
pickle.dump(sparse_matrix, open("data/sparse_matrix", "w+"))
return sparse_matrix

I consider this like an array of vectors and from the doc this is an acceptable input.

Perform DBSCAN clustering from vector array or distance matrix.

Link to the citation

So I try to do some thing to use scikit-learn:

sparse_matrix = pickle.load(open("data/sparse_matrix"))
X = StandardScaler().fit_transform(sparse_matrix)
db = DBSCAN(eps=1, min_samples=20).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
print labels

I did this using the example DBSCAN from scikit-learn. I have two question, the first one is: "is my matrix well formatted and suitable for this algorithm?" I've got this concern when I consider the number of dimension. The second question is "how I set the epsilon parameter (minimal distance between my point)?"


Solution

  • See the DBSCAN article for a suggestion how to choose epsilon based on the k-distance graph.

    Since your data is sparse, it probably is more appropriate to use e.g. Cosine distance rather than Euclidean distance. You should also use a sparse format. For all I know, numpy.zeros will create a dense matrix:

     sparse_matrix = numpy.zeros(...)
    

    is therefore misleading, because it is a dense matrix, just with mostly 0s.