The following question confuses me a lot. could you help me with it?(preferably by finding some academic reference.)
We normally use base-2 log function to calculate entropy in decision trees, is this because most of the nodes only allow binary branches?
If I want to have a node with many branches, is log2 still theoretically valid?
For example, in Xgboost, the training set input should be of the form of a matrix, I think that means we can only put numerical values as input.
Thank you very much!
Base 2 for the logarithm is almost certainly because we like to measure the entropy in bits. This is just a convention, some people use base e instead (nats instead of bits).
I cannot talk about Xgboost, but for discrete decision problems entropy comes into play as a performance measure, not directly as a result of the tree structure. You can calculate the information gain of any split (using any branching factor) from just the definition of entropy.
If you're looking for a book on information theory and probability, I can highly recommend MacKay (full PDF available). He covers quite a bit of machine learning and statistics. Decision trees are however not covered.