I'm a beginner to Apache Hadoop and so far I have performed the Word Count problem using mapReduce for learning purposes. My objective is to perform K-means clustering on a set of data say 1.5gig+.
What is the simplest way to perform K-means clustering using Hadoop? Should I modify my map and reduce functions according to K-means requirements or do I require Mahout (I haven't used it before), or can the objective be achieved without it?
Host OS is Win7 and I have setup HortonWorks Sandbox 2.3 on VirtualBox. Any help would be much appreciated as I'm a bit confused as to which path to choose to achieve my objective. Thanking you in anticipation.
I think easy way to do k means is K-MEANS . Spark run using hadoop hdfs.
Apache Spark
Here is examplea nd details you can find from spark site
public class KMeansExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("K-means Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse data
String path = "data/mllib/kmeans_data.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
parsedData.cache();
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// Save and load model
clusters.save(sc.sc(), "myModelPath");
KMeansModel sameModel = KMeansModel.load(sc.sc(), "myModelPath");
}
}