Summary: Looking for DBSCAN implementation of python code in clustering the multiple column csv file based on the column 'contents'
Input:
input csv file rows sample
Rank, Domain, Contents
1, abc.com, hello random text out
2, xyz.com, hello random somethingelse
3, not.com, a b c d
4, plus.com, a b asdsadsa asdsadasdsadsa
5, minus.com, man win
Where,
Column 1 => Rank = digit
Column 2 => Domain = domain name ex. abc.com
Column 3 => Contents = list of words (string, this is
extracted clean up words from html page)
Output :
The output of the cluster be based on similar list of contents
Cluster 1: abc.com, xyz.com
Cluster 2: not.com, plus.com
Cluster 3: minus.com
....
Please note: In output, I am not looking for words that are in same cluster. Instead, I am looking for a 'domain name', column which is clustered based on similar contents of column 3, 'contents'
I researched following resources but they are based on kmeans and does not relate to the DBSCAN cluster output that I am looking for. Please note, providing cluster number will not be applicable in this case as we do not want to limit the cluster number based on the input.
1) How can I cluster text data with multiple columns?
2) Clustering text documents using scikit-learn kmeans in Python
3) http://brandonrose.org/clustering
4) https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
so,
input <= csv file with 'Rank', 'Domain', 'Contents'
output <= cluster with domain name [NOT contents]
A python implementation in DBSCAN clustering would be an ideal.
Thanks!
You first need to select the "Contents" column of your dataset. You can use the csv
module of Python for that step.
Then you have to transform the texts into vectors on which DBSCAN can be trained. The second link you gave have everything you need to do that step.
Then you have to train DBSCAN on the vectors. You can use the implementation of DBSCAN in scikit-learn for instance.
Once you have the labels associated to the vectors (i.e. the lines of the csv file), you can group the number of lines by cluster and retrieve the domains.