Background: [Skip ahead to next section for exact problem]
I am currently working on Hadoop as a small project in my University (not a mandatory project, I am doing it because I want to).
My plan was to use 5 PCs in one of the labs (Master + 4 Slaves) to run a KNN algorithm on a large data set to find out the running time, etc.
I knew I could find the basic code on the internet, and I did find it ( https://github.com/matt-hicks/MapReduce-KNN ). It runs fine for a single test case, but what I have is a very large one with hundreds of test cases. Therefore, I needed to iterate the same bit of code for each test case.
The Problem:
tl;dr: I have a KNN program that only takes one test case at a time, but I want to make it iterative so that it can work with multiple test cases.
My Solution:
I am not very experienced with this and from basics I know, I decided to make the variables and maps into arrays of variables and arrays of maps.
So this:
public static class KnnMapper extends Mapper<Object, Text, NullWritable, DoubleString>
{
DoubleString distanceAndModel = new DoubleString();
TreeMap<Double, String> KnnMap = new TreeMap<Double, String>();
// Declaring some variables which will be used throughout the mapper
int K;
double normalisedSAge;
double normalisedSIncome;
String sStatus;
String sGender;
double normalisedSChildren;
became this:
DoubleString distanceAndModel = new DoubleString();
TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];
// Declaring some variables which will be used throughout the mapper
int[] K = new int[1000];
double[] normalisedSAge = new double[1000];
double[] normalisedSIncome = new double[1000];
String[] sStatus = new String[1000];
String[] sGender = new String[1000];
double[] normalisedSChildren = new double[1000];
int n = 0;
And this:
protected void setup(Context context) throws IOException, InterruptedException
{
if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
{
// Read parameter file using alias established in main()
String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
StringTokenizer st = new StringTokenizer(knnParams, ",");
// Using the variables declared earlier, values are assigned to K and to the test dataset, S.
// These values will remain unchanged throughout the mapper
K = Integer.parseInt(st.nextToken());
normalisedSAge = normalisedDouble(st.nextToken(), minAge, maxAge);
normalisedSIncome = normalisedDouble(st.nextToken(), minIncome, maxIncome);
sStatus = st.nextToken();
sGender = st.nextToken();
normalisedSChildren = normalisedDouble(st.nextToken(), minChildren, maxChildren);
}
}
became this:
protected void setup(Context context) throws IOException, InterruptedException
{
if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
{
// Read parameter file using alias established in main()
String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
//Splitting input File if we hit a newline character or return carriage i.e., Windown Return Key as input
StringTokenizer lineSt = new StringTokenizer(knnParams, "\n\r");
//Running a loop to tokennize each line of inputs or test cases
while(lineSt.hasMoreTokens()){
String nextLine = lineSt.nextToken(); //Converting current line to a string
StringTokenizer st = new StringTokenizer(nextLine, ","); // Tokenizing the current string or singular data
// Using the variables declared earlier, values are assigned to K and to the test dataset, S.
// These values will remain unchanged throughout the mapper
K[n] = Integer.parseInt(st.nextToken());
normalisedSAge[n] = normalisedDouble(st.nextToken(), minAge, maxAge);
normalisedSIncome[n] = normalisedDouble(st.nextToken(), minIncome, maxIncome);
sStatus[n] = st.nextToken();
sGender[n] = st.nextToken();
normalisedSChildren[n] = normalisedDouble(st.nextToken(), minChildren, maxChildren);
n++;
}}
}
And so on for the reducer class as well.
This is the first time I was working with TreeMaps though. I've studied and used trees before, but not Maps or TreeMaps. I still tried to make it and array which turned out to be wrong:
/home/hduser/Desktop/knn/KnnPattern.java:81: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^
/home/hduser/Desktop/knn/KnnPattern.java:198: error: incompatible types: double[] cannot be converted to double normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren); ^
/home/hduser/Desktop/knn/KnnPattern.java:238: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^
/home/hduser/Desktop/knn/KnnPattern.java:283: error: bad operand types for binary operator '>' if (KnnMap[num].size() > K) ^ first type: int second type: int[]
Now, I thought that maybe if I tried to use a Linked List of TreeMaps, it could work.
But, I have basically worked with C/C++ and Python in Uni so far. OOP here seems to make life easier for people but I am not a 100% sure how to use it.
My question:
Is it possible to make a Linked List of TreeMaps?
Is there a Linked List substitute for:
TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];
And is my approach to the problem correct? Making the code iterative should help iterate through all test cases, right?
I will, with try and error, try to make it work from there. But this is something I am kind of stuck at since a few days now.
My apologies if someone has already asked this before but I couldn't find anything and so I had to write a question. Please share the link of any related answer if you think this had already been answered before.
Thank you! And, on a side note: Anything else I should keep in mind when working with TreeMaps and specifically a linked list of TreeMaps.
Regarding the error messages
/home/hduser/Desktop/knn/KnnPattern.java:81: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^
and
/home/hduser/Desktop/knn/KnnPattern.java:238: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^
These errors occur because you tried to create an instance from a generic component type which is not supported by Java because the generic types are lost at runtime. A workaround (if you really need an array) would be to create a List
of TreeMap
objects and then to convert it to an array:
// TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];
List<TreeMap<Double, String>> KnnMapList = new LinkedList<>();
TreeMap<Double, String>[] KnnMap = (TreeMap<Double, String>[]) KnnMapList.toArray();
See this question for further information.
/home/hduser/Desktop/knn/KnnPattern.java:198: error: incompatible types: double[] cannot be converted to double normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren); ^
By looking at the source code at GitHub I realized that you probably did not modify the following method call in method KnnMapper#map(Object, Text, Context)
:
double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren);
should be
double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
normalisedRChildren, normalisedSAge[n], normalisedSIncome[n], sStatus[n], sGender[n], normalisedSChildren[n]);
But I guess that these modifications will not give you the desired function because KnnMapper#map(Object, Text, Context)
is only called once per key/value pair as stated here and you probably want to call it n-times.
Concrete problem
To prevent from further trouble I suggest that you leave the upper code of the GitHub class untouched and only modify the KnnPattern#main(String[])
method in a way so that it calls the job n-times as described in this answer.
Edit: Example
This is a modified KnnPattern#main(String[])
method that reads your data file line by line, creates a temporary file with the current line as content and starts a job with the temporary file as cache file.
(Assuming that you are using at least Java 7)
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
...
public class KnnPattern
{
...
public static void main(String[] args) throws Exception {
// Create configuration
Configuration conf = new Configuration();
if (args.length != 3) {
System.err.println("Usage: KnnPattern <in> <out> <parameter file>");
System.exit(2);
}
try (final BufferedReader br = new BufferedReader(new FileReader(args[2]))) {
int n = 1;
String line;
while ((line = br.readLine()) != null) {
// create temporary file with content of current line
final File tmpDataFile = File.createTempFile("hadoop-test-", null);
try (BufferedWriter tmpDataWriter = new BufferedWriter(new FileWriter(tmpDataFile))) {
tmpDataWriter.write(line);
tmpDataWriter.flush();
}
// Create job
Job job = Job.getInstance(conf, "Find K-Nearest Neighbour #" + n);
job.setJarByClass(KnnPattern.class);
// Set the third parameter when running the job to be the parameter file and give it an alias
job.addCacheFile(new URI(tmpDataFile.getAbsolutePath() + "#knnParamFile")); // Parameter file containing test data
// Setup MapReduce job
job.setMapperClass(KnnMapper.class);
job.setReducerClass(KnnReducer.class);
job.setNumReduceTasks(1); // Only one reducer in this design
// Specify key / value
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(DoubleString.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
// Input (the data file) and Output (the resulting classification)
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "_" + n));
// Execute job
final boolean jobSucceeded = job.waitForCompletion(true);
// clean up
tmpDataFile.delete();
if (!jobSucceeded) {
// return error status if job failed
System.exit(1);
}
++n;
}
}
}
}