Search code examples
weka

How can I select Yes/No qestionID dynamically in weka j48 App


I'm developing a Weka app like Akinator by using the j48 method.

Sample: http://jbossews-vdoctor.rhcloud.com/doctor

The following is the app's table definition and sample data qa means question id(Please refer the master which can be set by user) + answer(1:Yes, 2: I don't know, 3: No). 1 line per 1 question & answer.


id,qa,class

A,13,1

A,23,1

B,13,2

B,21,2


The point is to find a way to select the question which can maximize the entropy. Currently this app is regarding first node id of decision tree as the best question. And then it narrows down the options by this elimination way.

But the accuracy was too bad to run correctly so I'd like to improve it. I noticed that the qa column was identified as numeric so it could not build the correct decision tree.

I am confused what I should do for improvement. Dataset? Table definition? Logic?


Solution

  • This is quite a broad question that you are asking, and without code or a clear understanding of the problem it is quite difficult to answer, but I'll give some tips for improvement:

    Table Definition

    What may have made more sense here is to have an attribute for each question, instead of using a single instance per question. For Example, instead of id, qa and class, you could have A, B, C, D, E, F and Disease. (I believe there were six questions, and naming each attribute would be recommended instead of A-F)

    Dataset

    You will need at least as many cases as there are diseases, if not more for defining multiple subsets of the problem space for the same disease. There are likely cases where some questions are irrelevant or missing, and the model may need to handle such situations.

    Logic

    In such a case, you might be able to do the questionnaire by starting with the root node and asking questions until you reach the estimated class. This way, you can ask from node to node until a class is reached.

    I hope this helps in improving your existing model.

    NOTE: I tried your questionnaire and answered No to all of your questions, and I strangely ended up with Trichomoniasis. Perhaps there could be a 'No Disease' category for your training data also.