Search code examples
machine-learningweka

How to predict several unlabelled attributes at once using WEKA and NaiveBayes?


I have a binary array that has 96 elements, it could look someting like this:

[false, true, true, false, true, true, false, false, false, true.....]

Each element represents a time interval in 15 minutes starting from 00.00. The first element is 00.15, the second is 00.30, the third 00.45 etc. The boolean tells whether a house has been occupied in that time interval.

I want to train a classifier, so that it can predict the rest of a day, when only some part of the day is known. Let's say I have observations for the past 100 days, and I only know the the first 20 elements of the current day.

How can I use classification to predict the rest of the day?

I tried creating a ARFF file that looks like this:

@RELATION OccupancyDetection

@ATTRIBUTE Slot1 {true, false}
@ATTRIBUTE Slot2 {true, false}
@ATTRIBUTE Slot3 {true, false}
...
@ATTRIBUTE Slot96 {true, false}

@DATA
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,true,true,false,true,true,true,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
.....

And did a Naive Bayes classification on it. The problem is, that the results only show the success of one attribute (the last one, for instance).

A "real" sample taken on a given day might look like this:

true,true,true,true,true,true,true,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?

How can i predict all the unlabelled attributes at once?

I made this based on the WekaManual-3-7-11, and it works, but only for a single attribute:

    ..
    Instances unlabeled = DataSource.read("testWEKA1.arff");
    unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
    // create copy
    Instances labeled = new Instances(unlabeled);
    // label instances
    for (int i = 0; i < unlabeled.numInstances(); i++) {
    double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
    labeled.instance(i).setClassValue(clsLabel);
    DataSink.write("labeled.arff", labeled);

Solution

  • Sorry, but I don't believe that you can predict multiple attributes using Naive Bayes in Weka.

    What you could do as an alternative, if running Weka through Java code, is loop through all of the attributes that need to be filled. This could be done by building classifiers with n attributes and filling in the next blank until all of the missing data is entered.

    It also appears that what you have is time-based as well. Perhaps if the model was somewhat restructured, it may be able to all fit within a single model. For example, you could have attributes for prediction time, day of week and presence over the last few hours as well as attributes that describe historical presence in the house. It might be going over the top for your problem, but could also eliminate the need for multiple classifiers.

    Hope this Helps!

    Update!

    As per your request, I have taken a couple of minutes to think about the problem at hand. The thing about this time-based prediction is that you want to be able to predict the rest of the day, and the amount of data available for your classifier is dynamic depending on the time of day. This would mean that, given the current structure, you would need a classifier to predict values for each 15 minute time-slot, where earlier timeslots contain far less input data than the later timeslots.

    If it is possible, you could instead use a different approach where you could use an equal amount of historical information for each time slot and possibly share the same classifier for all cases. One possible set of information could be as outlined below:

    • The Time Slot to be estimated
    • The Day of Week
    • The Previous hour or two of activity
    • Other Activity for the previous 24 Hours
    • Historical Information about the general timeslot

    If you obtain your information on a daily basis, it may be possible to quantify each of these factors and then us it to predict any time slot. Then, if you wanted to predict it for a whole day, you could keep feeding it the previous predictions until you have completed the predictions for the day.

    I have done a similar problem for predicting time of arrival based on similar factors (previous behavior, public holidays, day of week, etc.) and the estimates were usually reasonable, though as accurate as you could expect for human process.