Search code examples
algorithmmachine-learninglanguage-agnosticadaboost

Adaboost Algorithm Walkthrough for non-technical


I looked at these questions here and here. But i still couldn't find a satisfactory result for me to understand the algorithm. I understood what it does, and the general idea on how it does it, but still i don't feel comfortable enough to use it in projects. I would like to understand how the algorithm corresponds to the more colloquial description of it. I tried reading several online sources, but most of them keep reproducing the examples of Schapire and Freund without really explaining what is going on. Here is the algorithm given in Schapire's Explaining AdaBoost, p. 2Adaboost Algorithm

Here is what i understand from it so far:
x1,y2, corresponds to the training set.
when given x1 we observe the y1 output.

This x+number is a member of X set.
And y+number is a member of set {-1,+1} so for the rest of the algorithm the training set is (x+number,-1),(x+number,+1), xm, -1 or +1) etc.
Initialize the distribution, not sure about the sense of distribution here but let's carry on.
D1(i) = 1/m for i = 1, ..., m.
I am not sure if there is any relation between the 1 of D1 and the 1 of 1/m, but as far as i understand the i corresponds to the position of the element in the training set.
I assume t means, something like time or instance, so that we t1 instance and t2 instance, etc.
By saying Train weak learner using distribution Dt, i assume we want to do something like Dt(i) = t/m ?
Get weak hypothesis ht: X -> {-1,+1} meaning that ht is so that if it takes an element from X set of inputs then it would give {-1,+1} as output.
Aim: select ht with low weighted error:
I don't really get what's going on here. The sign "~" usually corresponds to "not" in logical operators or sometimes used as an equivalent of circa in dates but i assume that is not the case here, does it correspond to something like Probability of i given Dt ?
The "et" at the beginning is i thought enthropy at first, but i read here that it is actually error. The "[" for indicating a matrix? ht(xi) != yi means that the ht produces the wrong output for the input xi, because normally the training set is defined like x1,y1,xm,ym and such.
For the remainder of the algorithm i have no idea. It would be great if someone can explain the rest in non technical terms. By non technical, i mean by trying to describe what the algorithm does in each step with respect to the previous step. It would be great if you can also give some explanation to why functions like "ln" and "sign" are used to describe what is going on. It would also be great if you can substitute the variables with something that is a little more concrete.

PS: I added the notation in code format because SO insisted that my question contained code before accepting the question


Solution

  • This x+number is a member of X set. And y+number is a member of set {-1,+1} so for the rest of the algorithm the training set is (x+number,-1),(x+number,+1), xm, -1 or +1) etc. Initialize the distribution, not sure about the sense of distribution here but let's carry on.

    Seems correct. The sense of distribution is a discrete probability mass function.

    D1(i) = 1/m for i = 1, ..., m. I am not sure if there is any relation between the 1 of D1 and the 1 of 1/m, but as far as i understand the i corresponds to the position of the element in the training set.

    No relation like that. Yes, that is what i means.

    I assume t means, something like time or instance, so that we t1 instance and t2 instance, etc.

    Yes, it is the "time" or more accurately the iteration number i.e. the number of weak learners that will be used.

    By saying Train weak learner using distribution Dt, i assume we want to do something like Dt(i) = t/m ?

    No absolutely not. Think about what Adaboost does: it combines weak learners into a strong one by iteratively constructing weak learners in a way that the (say) kth weak learner sort of compensates for the learners before. The probability distribution is to weight the data instances for each weak learner, so that instances of x's for which the current set of weak learners are poor are considered more strongly by newer weak learners.

    Get weak hypothesis ht: X -> {-1,+1} meaning that ht is so that if it takes an element from X set of inputs then it would give {-1,+1} as output. Aim: select ht with low weighted error: I don't really get what's going on here. The sign "~" usually corresponds to "not" in logical operators or sometimes used as an equivalent of circa in dates but i assume that is not the case here, does it correspond to something like Probability of i given Dt ?

    No. Indeed ~ is "not" in *programming languages" but it is not so in mathematics (where it is often e.g. an equivalence relation). In particular, in probability, it means "distributed as".

    The "et" at the beginning is i thought enthropy at first, but i read here that it is actually error. The "[" for indicating a matrix? ht(xi) != yi means that the ht produces the wrong output for the input xi, because normally the training set is defined like x1,y1,xm,ym and such. For the remainder of the algorithm i have no idea. It would be great if someone can explain the rest in non technical terms. By non technical, i mean by trying to describe what the algorithm does in each step with respect to the previous step. It would be great if you can also give some explanation to why functions like "ln" and "sign" are used to describe what is going on. It would also be great if you can substitute the variables with something that is a little more concrete.

    Many misunderstandings here. I'll try to describe the remaining steps (but I do suggest simply reading the original paper ultimately).

    -> Compute the error

    \epsilon_t = Pr_{i~D_t}[ h_t(x_i) != y_i ]
    

    This is literally what it looks like: the probability the tth weak learner is wrong on data point x_i. I guess it is confusing because it is a weighted probability, so the points with high D_t are more likely to be chosen and therefore contribute more to the probability measure here. So, in essence, we are just estimating the performance of h_t by looking at its error across the dataset (but considering some examples more important than others, specifically those we do poorly at).

    -> Estimate weighting factor

    alpha_t = ln[ (1 - \epsilon_t) / \epsilon_t ] / 2
    

    As mentioned repeatedly, we are trying to make new weak learners that do better on data examples that our other previous weak learners (lower t) failed at. This alpha_t is part of the weighting factor. Notice: if the error is 1 (i.e. it is literally the worst possible), then the weight is extremely tiny; this means "pay less attention to the stupid learners".

    Ultimately, when we combine all the weak learners together, we will have to weight them, as some will do better than others and hence should be listened to more strongly. This is what alpha_t measures

    The reason for the exact form of ln is mathematical; it is fairly easy (see here) to prove that the optimal weight (given some reasonable assumptions e.g. exponential loss) is indeed in the form given. But that is not so important for you, right now.

    -> Update the probability distribution (i.e. data point weighting function):

    D_{t+1}(i) = [D_t(i) exp(-alpha_t y_i h_t(x_i))] / Z_t
    

    Again, we want to weight "hard" examples larger. So, look at the form of this expression. we are updating the probability mass of x_i. It depends on

    • D_t(i): the previous probability value of x_i (if it was important before, it should be so now too)
    • alpha_y: the weight function for the weak learner (higher weight learner means the change in the probability mass for this x_i will be stronger)
    • y_i h_t(x_i): note that if the learner is wrong, then y_i h_t(x_i)=-1, otherwise y_i h_t(x_i)=1. In the former case, -alpha_t y_i h_t(x_i) will be positive (so exp(-alpha_t y_i h_t(x_i)) will be big); in the latter, negative (so exp(-alpha_t y_i h_t(x_i)) will be small). So, all this means is that we will upweight the probability of instance when we are wrong and downweight instances on which we are right. This makes sense: if h_t is correct on x_i, then h_{t+1} should not care as much about it; it should focus on cases where h_t was wrong, and try to make up for it!

    -> Combine the weak learners

    The final learner is just a weighted average of the weak learners (weighted by alpha_t). And that's that. Let me know if that made sense.

    //(Good God, I wish SO had Latex...)