Search code examples
hadoopcpucluster-analysismahout

Mahout cpu utilization in clustering


I use mahout to do text clustering

my PC device and software is below

server:
CPU:Intel Xeon E5-2620 2GHz,Ram:64GB

software:
ubuntu-12.4.1 on VirtualBox
hadoop-1.0.4,mahout-0.7

I use canopy algorithm to clustering 80000 txt. But it runs for a long time, just need two or three weeks to finish it, but I had found CPU utilization just below 20%.

I have found someone also has this problem, http://mail-archives.apache.org/mod_mbox/mahout-user/201212.mbox/%3C7959565186420075099@unknownmsgid%3E#archives

but I still doesn't know how to accelerate it, on the other hand, is some parameter setup I got loss? Or is the server is not powerful to run this job?


Solution

  • Hadoop and Mahout are meant for multiple computers. On a single host, a software optimized for this kind of operation will likely by much faster.

    Hadoop (and Mahout) manage data that is too large to fit into a single computers memory. This requires the data to be stored in files and to be transmitted over the network to other hosts.

    Now if you do this approach - repeatedly writing interim results - without the need to do so, you will of course be slower than if you would do everything in-memory.

    As your CPU is not fully used, you can probably guess there must be a bottleneck somewhere else. Have a look at your disk IO. This is probably currently your limiting factor.