In most of the architectures, conv layers are being followed by a pooling layer (max / avg etc.). As those pooling layers are just selecting the output of previous layer (i.e. conv), can we just use convolution with stride 2 and expect the similar accuracy results with reduced process need?
Yes that can be done. Its explained in the paper 'Striving for simplicity: The all convolutional net'
https://arxiv.org/pdf/1412.6806.pdf. Quote from the paper:
'We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks'