Search code examples
javaweka

clustering using weka api


I start cluster my data using open source code by using java + weka lib it run correctly when the format of the dataset .arff but I want to use the dataset of movielens (to cluster the user using their demographic information ) the file name is "u.user" you can find the file dicription here http://files.grouplens.org/datasets/movielens/ml-100k-README.txt

and this my code

import weka.clusterers.SimpleKMeans;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import java.io.IOException;
public class Clustering {
    public static void main(String args[]) throws Exception{
        //load dataset
        String dataset = "C:/Users/DELL/Desktop/work/u.user";
        DataSource source = new DataSource(dataset);
        //get instances object
        Instances data = source.getDataSet();
        // new instance of clusterer
        SimpleKMeans model = new SimpleKMeans();//Simple EM (expectation maximisation)
        //number of clusters
        model.setNumClusters(4);
        //set distance function
        //model.setDistanceFunction(new weka.core.ManhattanDistance());
        // build the clusterer
        model.buildClusterer(data);
        System.out.println(model);

}
}

after the run this error display

Exception in thread "main" java.io.IOException: File not found : C:\Users\DELL\Desktop\work\u.names
    weka.core.converters.C45Loader.setSource(C45Loader.java:190)
    weka.core.converters.AbstractFileLoader.setFile(AbstractFileLoader.java:90)
    weka.core.converters.ConverterUtils$DataSource.reset(ConverterUtils.java:306)
    weka.core.converters.ConverterUtils$DataSource.<init>(ConverterUtils.java:141)
    Clustering.main(Clustering.java:24)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:498)
    com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

    at weka.core.converters.C45Loader.setSource(C45Loader.java:190)
    at weka.core.converters.AbstractFileLoader.setFile(AbstractFileLoader.java:90)
    at weka.core.converters.ConverterUtils$DataSource.reset(ConverterUtils.java:306)
    at weka.core.converters.ConverterUtils$DataSource.<init>(ConverterUtils.java:141)
    at Clustering.main(Clustering.java:24)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Process finished with exit code 1

I am sure it because the extention of the file , beacause when I use other file with extention.arff it work can you help me how to cluster my data


Solution

  • You also need to pay attention on the file format (not only the extension). Convert the dataset format to match Weka ARFF format. In case of your data u.user, you need to change the extension to *.arff (eg. user.arff) and the format to something like:

    @RELATION user
    
    @ATTRIBUTE id   INTEGER  % this is actually useless
    @ATTRIBUTE age  INTEGER
    @ATTRIBUTE gender   {M,F}
    @ATTRIBUTE occupation   {administrator,artist,doctor,educator,engineer,entertainment,executive,healthcare,homemaker,lawyer,librarian,marketing,none,other,programmer,retired,salesman,scientist,student,technician,writer}  % from u.occupation
    @ATTRIBUTE zipcode  STRING
    
    @DATA
    1,24,M,technician,85711
    2,53,F,other,94043
    3,23,M,writer,32067
    4,24,M,technician,43537
    5,33,F,other,15213
    6,42,M,executive,98101
    7,57,M,administrator,91344
    8,36,M,administrator,05201
    ...
    

    You should be able to parse the dataset into a weka.core.Instances. But, unfortunately, SimpleKMeans will reject your data with:

    weka.core.UnsupportedAttributeTypeException: weka.clusterers.SimpleKMeans: Cannot handle string attributes!

    So you are left with (at least) 3 options:

    1. Vectorize or convert the features of your data to numeric values (also remove useless data like id)
    2. Use another clustering algorithm that can handle categorical values such as weka.clusterers.HierarchicalClusterer
    3. Combine both solution

    Good luck!