Search code examples
rdecision-treerpart

how do duplicated rows effect a decision tree?


I am using Rpart{} to build a decision tree for a categorical variable and I am wondering whether I should use the full data set of just the set of unique rows.


Solution

  • I am answering this as a general question on decision trees, rather than on the R implementation.

    The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting a weight on the values in those rows.

    This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.

    In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.