Search code examples
rmarkov-chainspsttraminersequence-analysis

What is the meaning of alpha in the context of an information gain pruning function?


In the PST package we use the value C as a cut-off for the information gain function used to prune the tree. The C value, for an alpha of 0.05 is calculated as follows:

C95 <- qchisq(0.95, 1) / 2

What does it mean that the C value is based on an alpha of 0.05? Does it mean we need to be at least 95% certain that an additional node adds more information compared to previous nodes, in order for it to be retained by the pruning algorithm?


Solution

  • Your question concerns the use of gain="G2" in the prune function and is about the choice of the threshold C for this gain function.

    Twice the G2 gain function used to check whether a branch can be pruned is actually the likelihood ratio test statistics that compares the likelihood of the tree before and after pruning the branch. The statistics 2*G2 has a Chi-squared distribution under the assumption that the tested branch does not add any information. So, the branch is pruned when the difference is not statistically significant, i.e. as long as the G2 value does not exceed the threshold for the given significance level.

    The alpha is the usual level of significance used in statistical tests. It is typically 1% or 5%. Choosing alpha= 0.05 means that there is 5% chance to wrongly NOT prune a branch because of the randomness of the sample.