I use mahout to do text clustering
my PC device and software is below
server:
CPU:Intel Xeon E5-2620 2GHz,Ram:64GB
software:
ubuntu-12.4.1 on VirtualBox
hadoop-1.0.4,mahout-0.7
I use canopy algorithm to clustering 80000 txt. But it runs for a long time, just need two or three weeks to finish it, but I had found CPU utilization just below 20%.
I have found someone also has this problem, http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C7959565186420075099@unknownmsgid%3E#archives
but I still doesn't know how to accelerate it, on the other hand, is some parameter setup I got loss? Or is the server is not powerful to run this job?
Hadoop and Mahout are meant for multiple computers. On a single host, a software optimized for this kind of operation will likely by much faster.
Hadoop (and Mahout) manage data that is too large to fit into a single computers memory. This requires the data to be stored in files and to be transmitted over the network to other hosts.
Now if you do this approach - repeatedly writing interim results - without the need to do so, you will of course be slower than if you would do everything in-memory.
As your CPU is not fully used, you can probably guess there must be a bottleneck somewhere else. Have a look at your disk IO. This is probably currently your limiting factor.