Search code examples
machine-learningcluster-analysismahout

How to take key and value in a CSV file for Kmeans clustering in Mahout


I am trying to run Kmeans clustering on below set of data,

Name,Gender,Age,Drinks,Country
John,M,30,Pepsi,US
Jack,M,25,Coke,US
David,M,34,Pepsi,UK
Ted,M,37,Limca,CAN
Robert,M,23,Limca,US
Adrian,M,31,Pepsi,US
Craig,M,37,Coke,UK
Katie,F,23,Limca,UK
Nancy,F,32,Pepsi,UK

I want to cluster the data based on Drinks(pepsi,coke,Limca)and i am able to do it.But i want to retrive name also alongside clustered data.

the output i am getting is

0
1
2 
Limca belongs to cluster:0
Cokde belongs to cluster:0
etc.

here i want to get the names also.

while converting to sequence file i am taking key as drinks and value as the rest of text and converting to sparsevector and then running Kmeans clustering,the names are not printed. can anybody point how i extract name from the clusters which are there in values.


Solution

  • You may need to convert {Pepsi, Coke, Pepsi, Limca} to something like {1001, 1002, 1001, 1003} and again convert back to original values.

    But as mentioned in one of the answers, just getting a group by drinks may not be a clustering job, it's just an SQL query. if your problem is more complex than grouping then you can try the approach that I mentioned in above Paragraph.