deep-learning computer-vision conv-neural-network artificial-intelligence

Why does CNNs usually have a stem?

Most cutting-edge/famous CNN architectures have a stem that does not use a block like the rest of the part of the network, instead, most architectures use plain Conv2d or pooling in the stem without special modules/layers like a shortcut(residual), an inverted residual, a ghost conv, and so on.
Why is this? Are there experiments/theories/papers/intuitions behind this?

examples of stems:
classic ResNet: Conv2d+MaxPool:

bag of tricks ResNet-C: 3*Conv2d+MaxPool,
even though 2 Conv2d can form the exact same structure as a classic residual block as shown below in [figure 2], there is no shortcut in stem:

there are many other examples that have similar observations, such as EfficientNet, MobileNet, GhostNet, SE-Net, and so on.

cite:
https://arxiv.org/abs/1812.01187
https://arxiv.org/abs/1512.03385

Solution

As far as I know, this is done in order to quickly downsample an input image with strided convolutions of quite large kernel size (5x5 or 7x7) so that further layers can effectively do their work with much less computational complexity.