Search code examples
deep-learningdeep-residual-networks

Intuition on Deep Residual Network


I was reading the Deep Residual Network paper and in the paper there is a concept that I cannot fully understand:

enter image description here

Question:

  1. What does it mean by "hope the 2 weight layers fit F(x)" ?

  2. Here F(x) is processing x with two weight layers(+ ReLu non-linear function), so the desired mapping is H(x)=F(x)? where is the residual?


Solution

  • What does it mean by "hope the 2 weight layers fit F(x)" ?

    So the residual unit shown obtains F(x) by processing x with two weight layers. Then it adds x to F(x) to obtain H(x). Now, assume that H(x) is your ideal predicted output which matches with your ground truth. Since H(x) = F(x) + x, obtaining the desired H(x) depends on getting the perfect F(x). That means the two weight layers in the residual unit should actually be able to produce the desired F(x), then getting the ideal H(x) is guaranteed.

    Here F(x) is processing x with two weight layers(+ ReLu non-linear function), so the desired mapping is H(x)=F(x)? where is the residual?

    First part is correct. F(x) is obtained from x as follows.

    x -> weight_1 -> ReLU -> weight_2
    

    H(x) is obtained from F(x) as follows.

    F(x) + x -> ReLU 
    

    So, I don't understand the second part of your question. The residual is F(x).

    The authors hypothesize that the residual mapping (i.e. F(x)) may be easier to optimize than H(x). To illustrate with a simple example, assume that the ideal H(x) = x. Then for a direct mapping it would be difficult to learn an identity mapping as there is a stack of non-linear layers as follows.

    x -> weight_1 -> ReLU -> weight_2 -> ReLU -> ... -> x
    

    So, to approximate the identity mapping with all these weights and ReLUs in the middle would be difficult.

    Now, if we define the desired mapping H(x) = F(x) + x, then we just need get F(x) = 0 as follows.

    x -> weight_1 -> ReLU -> weight_2 -> ReLU -> ... -> 0  # look at the last 0
    

    Achieving the above is easy. Just set any weight to zero and you will a zero output. Add back x and you get your desired mapping.

    Other factor in the success of residual networks is uninterrupted gradient flow from the first layer to the last layer. That is out of scope for your question. You can read the paper: "identity mappings in deep residual networks" for more information on this.