Search code examples
data-miningdecision-treebayesian-networks

Decision tree vs. Naive Bayes classifier


I am doing some research about different data mining techniques and came across something that I could not figure out. If any one have any idea that would be great.

In which cases is it better to use a Decision tree and other cases a Naive Bayes classifier?

Why use one of them in certain cases? And the other in different cases? (By looking at its functionality, not at the algorithm)

Anyone have some explanations or references about this?


Solution

  • Decision Trees are very flexible, easy to understand, and easy to debug. They will work with classification problems and regression problems. So if you are trying to predict a categorical value like (red, green, up, down) or if you are trying to predict a continuous value like 2.9, 3.4 etc Decision Trees will handle both problems. Probably one of the coolest things about Decision Trees is they only need a table of data and they will build a classifier directly from that data without needing any up front design work to take place. To some degree properties that don't matter won't be chosen as splits and will get eventually pruned so it's very tolerant of nonsense. To start it's set it and forget it.

    However, the downside. Simple decision trees tend to over fit the training data more so that other techniques which means you generally have to do tree pruning and tune the pruning procedures. You didn't have any upfront design cost, but you'll pay that back on tuning the trees performance. Also simple decision trees divide the data into squares so building clusters around things means it has to split a lot to encompass clusters of data. Splitting a lot leads to complex trees and raises probability you are overfitting. Tall trees get pruned back so while you can build a cluster around some feature in the data it might not survive the pruning process. There are other techniques like surrogate splits which let you split along several variables at once creating splits in the space that aren't either horizontal or perpendicular ( 0 < slope < infinity ). Cool, but your tree starts to become harder to understand, and its complex to implement these algorithms. Other techniques like boosting and random forest decision trees can perform quite well, and some feel these techniques are essential to get the best performance out of decision trees. Again this adds more things to understand and use to tune the tree and hence more things to implement. In the end the more we add to the algorithm the taller the barrier to using it.

    Naive Bayes requires you build a classification by hand. There's not way to just toss a bunch of tabular data at it and have it pick the best features it will use to classify. Picking which features matter is up to you. Decisions trees will pick the best features for you from tabular data. If there were a way for Naive Bayes to pick features you'd be getting close to using the same techniques that make decision trees work like that. Give this fact that means you may need to combine Naive Bayes with other statistical techniques to help guide you towards what features best classify and that could be using decision trees. Naive bayes will answer as a continuous classifier. There are techniques to adapt it to categorical prediction however they will answer in terms of probabilities like (A 90%, B 5%, C 2.5% D 2.5%) Bayes can perform quite well, and it doesn't over fit nearly as much so there is no need to prune or process the network. That makes them simpler algorithms to implement. However, they are harder to debug and understand because it's all probabilities getting multiplied 1000's of times so you have to be careful to test it's doing what you expect. Naive bayes does quite well when the training data doesn't contain all possibilities so it can be very good with low amounts of data. Decision trees work better with lots of data compared to Naive Bayes.

    Naive Bayes is used a lot in robotics and computer vision, and does quite well with those tasks. Decision trees perform very poorly in those situations. Teaching a decision tree to recognize poker hands by looking a millions of poker hands does very poorly because royal flushes and quads occurs so little it often gets pruned out. If it's pruned out of the resulting tree it will misclassify those important hands (recall tall trees discussion from above). Now just think if you are trying to diagnose cancer using this. Cancer doesn't occur in the population in large amounts, and it will get pruned out more likely. Good news is this can be handled by using weights so we weight a winning hand or having cancer as higher than a losing hand or not having cancer and that boosts it up the tree so it won't get pruned out. Again this is the part of tuning the resulting tree to the situation that I discussed earlier.

    Decision trees are neat because they tell you what inputs are the best predicators of the outputs so often decision trees can guide you to find if there is a statistical relationship between a given input to the output and how strong that relationship is. Often the resulting decision tree is less important than relationships it describes. So decision trees can be used a research tool as you learn about your data so you can build other classifiers.

    If you are dicing between using decision trees vs naive bayes to solve a problem often times it best to test each one. Build a decision tree and build a naive bayes classifier then have a shoot out using the training and validation data you have. Which ever performs best will more likely perform better in the field. And it's always a good idea to cast each of those against K-nearest neighbor (KNN) predictors because k-nearest has been shown to out perform both of them in some situations, and KNN is a simple algorithm to implement and use. If KNN performs better than the other two go with it.

    Some sources:

    The manual on CART based decision trees. This books covers the CART algorithm, but also discusses decision trees, weights, missing values, surrogate splits, boosting, etc. http://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418

    A gentler intro to CART https://www.youtube.com/watch?v=p17C9q2M00Q

    Comparison of algorithms - notice that KNN, Decision Trees, C4.5, and SVM do quite well on most of the tests. http://www4.ncsu.edu/~arezaei2/paper/JCIT4-184028_Camera%20Ready.pdf

    Another comparison of algorithms - Boosted Decision Trees and random top the list with KNN in the middle: http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf

    Another good run down of various techniques: http://www.quora.com/What-are-the-advantages-of-different-classification-algorithms