Search code examples
hadoopmahoutk-means

How to perform K-means using Apache Hadoop?


I'm a beginner to Apache Hadoop and so far I have performed the Word Count problem using mapReduce for learning purposes. My objective is to perform K-means clustering on a set of data say 1.5gig+.

What is the simplest way to perform K-means clustering using Hadoop? Should I modify my map and reduce functions according to K-means requirements or do I require Mahout (I haven't used it before), or can the objective be achieved without it?

Host OS is Win7 and I have setup HortonWorks Sandbox 2.3 on VirtualBox. Any help would be much appreciated as I'm a bit confused as to which path to choose to achieve my objective. Thanking you in anticipation.


Solution

  • I think easy way to do k means is K-MEANS . Spark run using hadoop hdfs.

    Apache Spark

    Here is examplea nd details you can find from spark site

    public class KMeansExample {
      public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("K-means Example");
        JavaSparkContext sc = new JavaSparkContext(conf);
    
        // Load and parse data
        String path = "data/mllib/kmeans_data.txt";
        JavaRDD<String> data = sc.textFile(path);
        JavaRDD<Vector> parsedData = data.map(
          new Function<String, Vector>() {
            public Vector call(String s) {
              String[] sarray = s.split(" ");
              double[] values = new double[sarray.length];
              for (int i = 0; i < sarray.length; i++)
                values[i] = Double.parseDouble(sarray[i]);
              return Vectors.dense(values);
            }
          }
        );
        parsedData.cache();
    
        // Cluster the data into two classes using KMeans
        int numClusters = 2;
        int numIterations = 20;
        KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
    
        // Evaluate clustering by computing Within Set Sum of Squared Errors
        double WSSSE = clusters.computeCost(parsedData.rdd());
        System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
    
        // Save and load model
        clusters.save(sc.sc(), "myModelPath");
        KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
      }
    }