MXNet (python3) defining a residual convolution structures as Block from Gluon module

NOTE:

I am new to MXNet.

It seems that the Gluon module is meant to replace(?) the Symbol module as the high level neural network (nn) interface. So this question specifically seeks an answer utilizing the Gluon module.

Context

Residual neural networks (res-NNs) are fairly popular architecture (the link provides a review of res-NNs). In brief, res-NNs is an architecture where the input undergoes a (series of) transformation(s) (e.g. through a standard nn layer) and at the end is combined with its unadulterated self prior to an activation function:

So the main question here is "How to implement a res-NN structure with a custom gluon.Block?" What follows is:

my attempt at doing this (which is incomplete and probably has errors)
as subquestions highlighted as block questions.

Normally sub-questions are seen as concurrent main questions resulting in the post being flagged as too general. In this case, they are legit sub questions, as my inability to solve my main questions stems from these sub-questions and the partial / first-draft documentation of the gluon module is insufficient to answer them.

Main Question

"How to implement a res-NN structure with a custom gluon.Block?"

First lets do some imports:

import mxnet as mx
import numpy as np
import math
import random
gpu_device=mx.gpu()
ctx = gpu_device

Prior to defining our res-NN structure, first we define a common convolution NN (cnn) architecture; namely, convolution → batch norm. → ramp.

class CNN1D(mx.gluon.Block):
    def __init__(self, channels, kernel, stride=1, padding=0, **kwargs):
        super(CNN1D, self).__init__(**kwargs) 
        with self.name_scope():
            self.conv = mx.gluon.nn.Conv1D(channels=channels, kernel_size=kernel, strides=1, padding=padding)      
            self.bn = mx.gluon.nn.BatchNorm()
            self.ramp = mx.gluon.nn.Activation(activation='relu')

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.ramp(x)
        return x

Subquestion: mx.gluon.nn.Activation vs NDArray module's nd.relu? When to use which and why. In all MXNet tutorials / demos I saw in their documentation, custom gluon.Blocks use nd.relu(x) in the forward function.

Subquestion: self.ramp(self.conv(x)) vs mx.gluon.nn.Conv1D(activation='relu')(x)? i.e. what is the consequence of adding the activation argument to a layer? Does that mean the activation is automatically applied in the forward function when that layer is called?

Now that we have a re-usable cnn chuck, let's define a res-NN where:

there are chain_length number of cnn chucks
the first cnn chuck uses a different stride than all the subsequent

so here is my attempt:

class RES_CNN1D(mx.gluon.Block):
    def __init__(self, channels, kernel, initial_stride, chain_length=1, stride=1, padding=0, **kwargs):
        super(RES_CNN1D, self).__init__(**kwargs)
        with self.name_scope():
            num_rest = chain_length - 1
            self.ramp = mx.gluon.nn.Activation(activation='relu')
            self.init_cnn = CNN1D(channels, kernel, initial_stride, padding)
            # I am guessing this is how to correctly add an arbitrary number of chucks
            self.rest_cnn = mx.gluon.nn.Sequential()
            for i in range(num_rest):
                self.rest_cnn.add(CNN1D(channels, kernel, stride, padding))


    def forward(self, x):
        # make a copy of untouched input to send through chuncks
        y = x.copy()
        y = self.init_cnn(y)
        # I am guess that if I call a mx.gluon.nn.Sequential object that all nets inside are called / the input gets passed along all of them?
        y = self.rest_cnn(y)
        y += x
        y = self.ramp(y)
        return y

Subquestion: adding a variable number of layers, should one use the hacky eval("self.layer" + str(i) + " = mx.gluon.nn.Conv1D()") or is this what mx.gluon.nn.Sequential is meant for?

Subquestion: when defining the forward function in a custom gluon.Block which has an instance of mx.gluon.nn.Sequential (let us refer to it as self.seq), does self.seq(x) just pass the argument x down the line? e.g. if this is self.seq

self.seq = mx.gluon.nn.Sequential()

self.conv1 = mx.gluon.nn.Conv1D()

self.conv2 = mx.gluon.nn.Conv1D()

self.seq.add(self.conv1)

self.seq.add(self.conv2)

is self.seq(x) equivalent to self.conv2(self.conv1(x))?

Is this correct?

The desired result for

RES_CNN1D(10, 3, 2, chain_length=3)

should look like this

Conv1D(10, 3, stride=2)  -----
BatchNorm                    |
Ramp                         |
Conv1D(10, 3)                |
BatchNorm                    |
Ramp                         |
Conv1D(10, 3)                |
BatchNorm                    |
Ramp                         |
  |                          |
 (+)<-------------------------
  v
Ramp

Solution

self.ramp(self.conv(x)) vs mx.gluon.nn.Conv1D(activation='relu')(x) Yes. The latter applies a relu activation to the output of Conv1D.
mx.gluon.nn.Sequential is for grouping multiple layers into a block. Usually you don't need to explicitly define each layer as a class attribute. You can create a list to store all the layers you want to group and use a for loop to add all list elements into mx.gluon.nn.Sequential object.
Yes. Call forward on mx.gluon.nn.Sequential is equal to call forward on all child blocks, with topological order of computation graph.