Search code examples
wekaknn

Where can I find practical example of KNN in java using weka


I have been searching for a practical example of KNN implementation using weka, but all I find is too general for me to understand the data that it needs to be able to work (or maybe how to make the objects that it needs to work) and also the results it shows, maybe someone that has worked with it before has a better example like with realistic things (products, movies, books, etc) and not the typical letters you see on algebra.

So I can figure out how to implement it on my case (which is recommend dishes to active user with KNN), would be highly appreciated, thanks.

I was trying to understand with this link https://www.ibm.com/developerworks/library/os-weka3/index.html but I don't even understand how did they get this results and how did they get the formula

knn

Step 1: Determine Distance Formula

Distance = SQRT( ((58 - Age)/(69-35))^2) + ((51000 - Income)/(150000-38000))^2 )

why is it always /(69-35) and also /(150000-38000) ?

EDIT:

Heres the Code I have tried without success, if someone can clear it for me I appreacite, also I did this code by combining this 2 answers:

This answer shows how to get the knn:

How to get the nearest neighbor in weka using java

And this one tells me how to create instances (which I don't really know what they are for weka) Adding a new Instance in weka

So I came up with this:

public class Wekatest {

    public static void main(String[] args) {

        ArrayList<Attribute> atts = new ArrayList<>();
        ArrayList<String> classVal = new ArrayList<>();
        // I don't really understand whats happening here
        classVal.add("A");
        classVal.add("B");
        classVal.add("C");
        classVal.add("D");
        classVal.add("E");
        classVal.add("F");

        atts.add(new Attribute("content", (ArrayList<String>) null));
        atts.add(new Attribute("@@class@@", classVal));

        // Here in my case the data to evaluate are dishes (plato mean dish in spanish)
        Instances dataRaw = new Instances("TestInstancesPlatos", atts, 0);

        // I imagine that every instance is like an Object that will be compared with the other instances, to get its neaerest neightbours (so an instance is like a dish for me)..

        double[] instanceValue1 = new double[dataRaw.numAttributes()];

        instanceValue1[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue1[1] = 0;

        dataRaw.add(new DenseInstance(1.0, instanceValue1));

        double[] instanceValue2 = new double[dataRaw.numAttributes()];

        instanceValue2[0] = dataRaw.attribute(0).addStringValue("Tunas");
        instanceValue2[1] = 1;

        dataRaw.add(new DenseInstance(1.0, instanceValue2));

        double[] instanceValue3 = new double[dataRaw.numAttributes()];

        instanceValue3[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue3[1] = 2;

        dataRaw.add(new DenseInstance(1.0, instanceValue3));

        double[] instanceValue4 = new double[dataRaw.numAttributes()];

        instanceValue4[0] = dataRaw.attribute(0).addStringValue("Hamburguers");
        instanceValue4[1] = 3;

        dataRaw.add(new DenseInstance(1.0, instanceValue4));

        double[] instanceValue5 = new double[dataRaw.numAttributes()];

        instanceValue5[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue5[1] = 4;

        dataRaw.add(new DenseInstance(1.0, instanceValue5));

        System.out.println("---------------------");

        weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(dataRaw);
        try {

            // This method receives the goal instance which you wanna know its neighbours and N (I don't really know what N is but I imagine it is the number of neighbours I want)
            Instances nearestInstances = knn.kNearestNeighbours(dataRaw.get(0), 1);
            // I expected the output to be the closes neighbour to dataRaw.get(0) which would be Pizzas, but instead I got some data that I don't really understand.


            System.out.println(nearestInstances);

        } catch (Exception e) {

            e.printStackTrace();
        }

    }

}

OUTPUT:

---------------------
@relation TestInstancesPlatos

@attribute content string
@attribute @@class@@ {A,B,C,D,E,F}

@data
Pizzas,A
Tunas,B
Pizzas,C
Hamburguers,D

weka dependency used:

<dependency>
        <groupId>nz.ac.waikato.cms.weka</groupId>
        <artifactId>weka-stable</artifactId>
        <version>3.8.0</version>
    </dependency>

Solution

  • KNN is a machine learning technique usually classified as an "Instance-Based predictor". It takes all instances of classified samples and draws them in a n-dimensional space.

    Using algorithms such as Euclidean distance, KNN looks for the closest points in this n-dimensional space and estimates to which class it belongs based on these neighbors. If it is closer to blue dots, it is blue, if its closer to red dots...

    But now, how could we apply it to your problem?

    Imagine that you only have two attributes, price and calories (2 dimensional space). You want to classify customers into three classes: fit, junk-food, gourmet. With this, you can offer a deal in a restaurant similar to the customer's preferences.

    You have the following data:

    +-------+----------+-----------+
    | Price | Calories | Food Type |
    +-------+----------+-----------+
    | $2    |    350   | Junk Food |
    +-------+----------+-----------+
    | $5    |    700   | Junk Food |
    +-------+----------+-----------+
    | $10   |    200   | Fit       |
    +-------+----------+-----------+
    | $3    |    400   | Junk Food |
    +-------+----------+-----------+
    | $8    |    150   | Fit       |
    +-------+----------+-----------+
    | $7    |    650   | Junk Food |
    +-------+----------+-----------+
    | $5    |    120   | Fit       |
    +-------+----------+-----------+
    | $25   |    230   | Gourmet   |
    +-------+----------+-----------+
    | $12   |    210   | Fit       |
    +-------+----------+-----------+
    | $40   |    475   | Gourmet   |
    +-------+----------+-----------+
    | $37   |    600   | Gourmet   |
    +-------+----------+-----------+
    

    Now, let's see it plotted in a 2D space:

    Plot

    What happens next?

    For every new entry, the algorithm calculates the distance to all dots (instances) and find the k nearest ones. From the class of these k nearest ones, it defines the class of the new entry.

    Take k = 3 and values $15 and 165 cal. Let's find the 3 nearest neighbors:

    New classif

    There's where the Distance formula comes on. It actually makes this computation for every dot. These distances are then "ranked" and the k closest ones compose the final class.

    Now, Why the values /(69-35) and also /(150000-38000)? As mentioned in other answers, this is due to normalization. Our example uses price and cal. As seen, calories are in a greater order than money (more units per value). To avoid inbalances, such as the one that can make calories more valuable for class than price (which would kill Gourmet class, for example), there's the need to make all attributes similarly important, hence the use of normalization.

    Weka abstracts that for you, but you can visualize it as well. See an example of visualization from a project I made for a Weka ML course:

    WekaVisualize

    Notice that, since there are many more than 2 dimensions, there are a lot of plots, but the idea is similar.

    Explaining the code:

    public class Wekatest {
    
        public static void main(String[] args) {
    //These two ArrayLists are the inputs of your algorithm.
    //atts are the attributes that you're going to pass for training, usually called X.
    //classVal is the target class that is to be predicted, usually called y.
            ArrayList<Attribute> atts = new ArrayList<>();
            ArrayList<String> classVal = new ArrayList<>();
    //Here you initiate a "dictionary" of all distinct types of restaurants that can be targeted.
            classVal.add("A");
            classVal.add("B");
            classVal.add("C");
            classVal.add("D");
            classVal.add("E");
            classVal.add("F");
    // The next two lines initiate the attributes, one made of "content" and other pertaining to the class of the already labeled values.
            atts.add(new Attribute("content", (ArrayList<String>) null));
            atts.add(new Attribute("@@class@@", classVal));
    
    //This loads a Weka object of data for training, using attributes and classes from a file "TestInstancePlatos" (or should happen).
    //dataRaw contains a set of previously labelled instances that are going to be used do "train the model" (kNN actually doesn't tain anything, but uses all data for predictions)
            Instances dataRaw = new Instances("TestInstancesPlatos", atts, 0);
    
    
    //Here you're starting new instances to test your model. This is where you can substitute for new inputs for production.
            double[] instanceValue1 = new double[dataRaw.numAttributes()];
    
    //It looks you only have 2 attributes, a food product and a rating maybe.
            instanceValue1[0] = dataRaw.attribute(0).addStringValue("Pizzas");
            instanceValue1[1] = 0;
    
    //You're appending this new instance to the model for evaluation.
            dataRaw.add(new DenseInstance(1.0, instanceValue1));
    
            double[] instanceValue2 = new double[dataRaw.numAttributes()];
    
            instanceValue2[0] = dataRaw.attribute(0).addStringValue("Tunas");
            instanceValue2[1] = 1;
    
            dataRaw.add(new DenseInstance(1.0, instanceValue2));
    
            double[] instanceValue3 = new double[dataRaw.numAttributes()];
    
            instanceValue3[0] = dataRaw.attribute(0).addStringValue("Pizzas");
            instanceValue3[1] = 2;
    
            dataRaw.add(new DenseInstance(1.0, instanceValue3));
    
            double[] instanceValue4 = new double[dataRaw.numAttributes()];
    
            instanceValue4[0] = dataRaw.attribute(0).addStringValue("Hamburguers");
            instanceValue4[1] = 3;
    
            dataRaw.add(new DenseInstance(1.0, instanceValue4));
    
            double[] instanceValue5 = new double[dataRaw.numAttributes()];
    
            instanceValue5[0] = dataRaw.attribute(0).addStringValue("Pizzas");
            instanceValue5[1] = 4;
    
            dataRaw.add(new DenseInstance(1.0, instanceValue5));
    
    // After adding 5 instances, time to test:
            System.out.println("---------------------");
    
    //Load the algorithm with data.
            weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(dataRaw);
    //You're predicting the class of value 0 of your data raw values. You're asking the answer among 1 neighbor (second attribute)
            try {
                Instances nearestInstances = knn.kNearestNeighbours(dataRaw.get(0), 1);
    //You will get a value among A and F, that are the classes passed.
               System.out.println(nearestInstances);
    
            } catch (Exception e) {
    
                e.printStackTrace();
            }
    
        }
    
    }
    
    

    How should you do it?

    -> Gather data. 
    -> Define a set of attributes that help you to predict which cousine you have (ex.: prices, dishes or ingredients (have one attribute for each dish or ingredient). 
    -> Organize this data. 
    -> Define a set of labels.
    -> Manually label a set of data.
    -> Load labelled data to KNN.
    -> Label new instances by passing their attributes to KNN. It'll return you the label of the k nearest neighbors (good values for k are 3 or 5, have to test).
    -> Have fun!