Search code examples
hadoopmahout

clusteredPoints of cluster result disappear [mahout]


I got CSV and TEXT format results like followings with clusterdump.

CSV:

0,Sports_38.txt
1,Sports_23.txt
2,Sports_36.txt
3,Sports_13.txt
4,Sports_31.txt,Sports_32.txt
5,Sports_28.txt,Sports_29.txt
6,Sports_2.txt
9,Sports_15.txt

TEXT:

{"identifier":"VL-1","r":[],"c":[...,"n":7}
Top Terms: 
    什                                       =>  15.829998016357422
    利物浦                                     =>  13.629814147949219
    克                                       =>  11.317766189575195
    格                                       =>  10.938775062561035
    特                                       =>  10.842317581176758
    尔                                       =>  10.447234153747559
    切尔西                                     =>   9.742402076721191
    比赛                                      =>   8.247735023498535
    表现                                      =>   7.909337520599365
    批评                                      =>   7.462332725524902

I noticed that just one point of VL-1 in CSV file but 7 points of VL-1 in TEXT file (VL-1's "n" equals 7).

Why did some points disappear? And how can I get every points' cluster?

Thanks a lot.


Solution

  • I also got empty clusteredPoints if the data is a little bigger.

    I finally found the reason by myself.

    clusterClassificationThreshold should be 0 in Kmeans.run's 8th parameter.(mahout 1.0)

    Check this: http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700@windwardsolutions.com%3E