python machine-learning scikit-learn cluster-analysis dbscan

Cluster string based on DBSCAN

Summary: Looking for DBSCAN implementation of python code in clustering the multiple column csv file based on the column 'contents'

Input:

    input csv file rows sample

    Rank, Domain, Contents      

    1, abc.com, hello random text out
    2, xyz.com, hello random somethingelse
    3, not.com, a b c d
    4, plus.com, a b asdsadsa asdsadasdsadsa
    5, minus.com, man win 

   Where,

   Column 1 => Rank = digit
   Column 2 => Domain = domain name ex. abc.com
   Column 3 => Contents = list of words (string, this is 
extracted clean up words from html page)

Output :

    The output of the cluster be based on similar list of contents

    Cluster 1: abc.com, xyz.com
    Cluster 2: not.com, plus.com
    Cluster 3: minus.com
    ....

    Please note: In output, I am not looking for words that are in same cluster. Instead, I am looking for a 'domain name', column which is clustered based on similar contents of column 3, 'contents'

I researched following resources but they are based on kmeans and does not relate to the DBSCAN cluster output that I am looking for. Please note, providing cluster number will not be applicable in this case as we do not want to limit the cluster number based on the input.

1) How can I cluster text data with multiple columns?

2) Clustering text documents using scikit-learn kmeans in Python

3) http://brandonrose.org/clustering

4) https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

5) https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52

so,

input <= csv file with 'Rank', 'Domain', 'Contents'
output <= cluster with domain name [NOT contents]

A python implementation in DBSCAN clustering would be an ideal.

Thanks!

Solution

You first need to select the "Contents" column of your dataset. You can use the csv module of Python for that step.

Then you have to transform the texts into vectors on which DBSCAN can be trained. The second link you gave have everything you need to do that step.

Then you have to train DBSCAN on the vectors. You can use the implementation of DBSCAN in scikit-learn for instance.

Once you have the labels associated to the vectors (i.e. the lines of the csv file), you can group the number of lines by cluster and retrieve the domains.