Search code examples
javalinked-listtreemapknn

MapReduce-KNN for Hadoop - run multiple test cases from one data file


Background: [Skip ahead to next section for exact problem]

I am currently working on Hadoop as a small project in my University (not a mandatory project, I am doing it because I want to).

My plan was to use 5 PCs in one of the labs (Master + 4 Slaves) to run a KNN algorithm on a large data set to find out the running time, etc.

I knew I could find the basic code on the internet, and I did find it ( https://github.com/matt-hicks/MapReduce-KNN ). It runs fine for a single test case, but what I have is a very large one with hundreds of test cases. Therefore, I needed to iterate the same bit of code for each test case.

The Problem:

tl;dr: I have a KNN program that only takes one test case at a time, but I want to make it iterative so that it can work with multiple test cases.

My Solution:

I am not very experienced with this and from basics I know, I decided to make the variables and maps into arrays of variables and arrays of maps.

So this:

    public static class KnnMapper extends Mapper<Object, Text, NullWritable, DoubleString>
    {
        DoubleString distanceAndModel = new DoubleString();
        TreeMap<Double, String> KnnMap = new TreeMap<Double, String>();

        // Declaring some variables which will be used throughout the mapper
        int K;

        double normalisedSAge;
        double normalisedSIncome;
        String sStatus;
        String sGender;
double normalisedSChildren;

became this:

DoubleString distanceAndModel = new DoubleString();
    TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];



    // Declaring some variables which will be used throughout the mapper
    int[] K = new int[1000];

    double[] normalisedSAge = new double[1000];
    double[] normalisedSIncome = new double[1000];
    String[] sStatus = new String[1000];
    String[] sGender = new String[1000];
    double[] normalisedSChildren = new double[1000];
    int n = 0;

And this:

        protected void setup(Context context) throws IOException, InterruptedException
    {
        if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
        {
            // Read parameter file using alias established in main()
            String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
            StringTokenizer st = new StringTokenizer(knnParams, ",");

            // Using the variables declared earlier, values are assigned to K and to the test dataset, S.
            // These values will remain unchanged throughout the mapper
            K = Integer.parseInt(st.nextToken());
            normalisedSAge = normalisedDouble(st.nextToken(), minAge, maxAge);
            normalisedSIncome = normalisedDouble(st.nextToken(), minIncome, maxIncome);
            sStatus = st.nextToken();
            sGender = st.nextToken();
            normalisedSChildren = normalisedDouble(st.nextToken(), minChildren, maxChildren);
        }

}

became this:

protected void setup(Context context) throws IOException, InterruptedException
    {
        if (context.getCacheFiles() != null && context.getCacheFiles().length > 0)
        {
            // Read parameter file using alias established in main()
            String knnParams = FileUtils.readFileToString(new File("./knnParamFile"));
            //Splitting input File if we hit a newline character or return carriage i.e., Windown Return Key as input
            StringTokenizer lineSt = new StringTokenizer(knnParams, "\n\r");

            //Running a loop to tokennize each line of inputs or test cases
            while(lineSt.hasMoreTokens()){
            String nextLine = lineSt.nextToken();   //Converting current line to a string
            StringTokenizer st = new StringTokenizer(nextLine, ","); // Tokenizing the current string or singular data

            // Using the variables declared earlier, values are assigned to K and to the test dataset, S.
            // These values will remain unchanged throughout the mapper
            K[n] = Integer.parseInt(st.nextToken());
            normalisedSAge[n] = normalisedDouble(st.nextToken(), minAge, maxAge);
            normalisedSIncome[n] = normalisedDouble(st.nextToken(), minIncome, maxIncome);
            sStatus[n] = st.nextToken();
            sGender[n] = st.nextToken();
            normalisedSChildren[n] = normalisedDouble(st.nextToken(), minChildren, maxChildren);
            n++;
        }}
    }

And so on for the reducer class as well.

This is the first time I was working with TreeMaps though. I've studied and used trees before, but not Maps or TreeMaps. I still tried to make it and array which turned out to be wrong:

/home/hduser/Desktop/knn/KnnPattern.java:81: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

/home/hduser/Desktop/knn/KnnPattern.java:198: error: incompatible types: double[] cannot be converted to double normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren); ^

/home/hduser/Desktop/knn/KnnPattern.java:238: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

/home/hduser/Desktop/knn/KnnPattern.java:283: error: bad operand types for binary operator '>' if (KnnMap[num].size() > K) ^ first type: int second type: int[]

Now, I thought that maybe if I tried to use a Linked List of TreeMaps, it could work.

But, I have basically worked with C/C++ and Python in Uni so far. OOP here seems to make life easier for people but I am not a 100% sure how to use it.

My question:

Is it possible to make a Linked List of TreeMaps?

Is there a Linked List substitute for:

TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];

And is my approach to the problem correct? Making the code iterative should help iterate through all test cases, right?

I will, with try and error, try to make it work from there. But this is something I am kind of stuck at since a few days now.

My apologies if someone has already asked this before but I couldn't find anything and so I had to write a question. Please share the link of any related answer if you think this had already been answered before.

Thank you! And, on a side note: Anything else I should keep in mind when working with TreeMaps and specifically a linked list of TreeMaps.


Solution

  • Regarding the error messages

    /home/hduser/Desktop/knn/KnnPattern.java:81: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

    and

    /home/hduser/Desktop/knn/KnnPattern.java:238: error: generic array creation TreeMap[] KnnMap = new TreeMap[1000]; ^

    These errors occur because you tried to create an instance from a generic component type which is not supported by Java because the generic types are lost at runtime. A workaround (if you really need an array) would be to create a List of TreeMap objects and then to convert it to an array:

    // TreeMap<Double, String>[] KnnMap = new TreeMap<Double, String>[1000];
    List<TreeMap<Double, String>> KnnMapList = new LinkedList<>();
    TreeMap<Double, String>[] KnnMap = (TreeMap<Double, String>[]) KnnMapList.toArray();
    

    See this question for further information.


    /home/hduser/Desktop/knn/KnnPattern.java:198: error: incompatible types: double[] cannot be converted to double normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren); ^

    By looking at the source code at GitHub I realized that you probably did not modify the following method call in method KnnMapper#map(Object, Text, Context):

    double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
                        normalisedRChildren, normalisedSAge, normalisedSIncome, sStatus, sGender, normalisedSChildren);
    

    should be

    double tDist = totalSquaredDistance(normalisedRAge, normalisedRIncome, rStatus, rGender,
                        normalisedRChildren, normalisedSAge[n], normalisedSIncome[n], sStatus[n], sGender[n], normalisedSChildren[n]);
    

    But I guess that these modifications will not give you the desired function because KnnMapper#map(Object, Text, Context) is only called once per key/value pair as stated here and you probably want to call it n-times.


    Concrete problem

    To prevent from further trouble I suggest that you leave the upper code of the GitHub class untouched and only modify the KnnPattern#main(String[]) method in a way so that it calls the job n-times as described in this answer.


    Edit: Example

    This is a modified KnnPattern#main(String[]) method that reads your data file line by line, creates a temporary file with the current line as content and starts a job with the temporary file as cache file.
    (Assuming that you are using at least Java 7)

    import java.io.BufferedReader;
    import java.io.BufferedWriter;
    import java.io.FileReader;
    import java.io.FileWriter;
    ...
    public class KnnPattern
    {
      ...
        public static void main(String[] args) throws Exception {
            // Create configuration
            Configuration conf = new Configuration();
    
            if (args.length != 3) {
                System.err.println("Usage: KnnPattern <in> <out> <parameter file>");
                System.exit(2);
            }
    
            try (final BufferedReader br = new BufferedReader(new FileReader(args[2]))) {
                int n = 1;
                String line;
                while ((line = br.readLine()) != null) {
                    // create temporary file with content of current line
                    final File tmpDataFile = File.createTempFile("hadoop-test-", null);
                    try (BufferedWriter tmpDataWriter = new BufferedWriter(new FileWriter(tmpDataFile))) {
                        tmpDataWriter.write(line);
                        tmpDataWriter.flush();
                    }
    
                    // Create job
                    Job job = Job.getInstance(conf, "Find K-Nearest Neighbour #" + n);
                    job.setJarByClass(KnnPattern.class);
                    // Set the third parameter when running the job to be the parameter file and give it an alias
                    job.addCacheFile(new URI(tmpDataFile.getAbsolutePath() + "#knnParamFile")); // Parameter file containing test data
    
                    // Setup MapReduce job
                    job.setMapperClass(KnnMapper.class);
                    job.setReducerClass(KnnReducer.class);
                    job.setNumReduceTasks(1); // Only one reducer in this design
    
                    // Specify key / value
                    job.setMapOutputKeyClass(NullWritable.class);
                    job.setMapOutputValueClass(DoubleString.class);
                    job.setOutputKeyClass(NullWritable.class);
                    job.setOutputValueClass(Text.class);
    
                    // Input (the data file) and Output (the resulting classification)
                    FileInputFormat.addInputPath(job, new Path(args[0]));
                    FileOutputFormat.setOutputPath(job, new Path(args[1] + "_" + n));
    
                    // Execute job
                    final boolean jobSucceeded = job.waitForCompletion(true);
    
                    // clean up
                    tmpDataFile.delete();
    
                    if (!jobSucceeded) {
                        // return error status if job failed
                        System.exit(1);
                    }
    
                    ++n;
                }
            }
        }
    
    }