I have a data set and I am doing classification using Weka NaiveBayes classifier. I have 14 attributes, some of which are nominals.
In only one of these attributes, I have some missing values. What I have done so far is that I have left them as missing values, and I know that Weka replaces those values automatically (a question is asked here about that ).
I mean, the values for this attribute are empty in my feature file, and when I create the ARFF file, I see "?" between the two commas.
Now, I have two possibilities: 1) Let them be filled by Weka automatically. 2) Replace them by "NULL".
The problem is that in the first case, the classifier works better. Now, I am wondering if it is allowed to let them be replaced by Weka? Or should I use the second approach, even though I get worse results?
I mean, "when" should we let Weka replace the missing values? and when not?
Meanwhile, the feature which has missing values represents the WordNet supersense of the words and when it is empty, it means that the instance is, for example, a preposition, or a WH question.
Thanks in advance,
Well, about missing values, weka doesn't replace them by default, you have to use filter (exactly as in post you linked first in your question). Some classifiers can handle missing values, I think Naive Bayes can, just by don't count them in probability calculation. So basically you have three options. Use ReplaceMissingValues filter to replace missing values with mode values, don't use filter and use dataset with missing values (in this case I recommend you to have a look how Naive Bayes works, to understand how your missing values will be treated and if it is good for you) and final option, replace your missing values with your own label like "other values" or so. Probably the key for correct choice is in your last paragraph, that suggest that your missing values probably means something. If this is so, I will use third approach - your new label. On the other hand, if missing values doesn't means anything and are just result of some fault in data collection I will think about first two approaches. Good luck.