deep-learning pytorch conv-neural-network resnet encoder-decoder

Encoder - Decoder neural network architecture with different input and output size

I am trying to figure out what would be a good architecture for neural network that takes projections (2D images) from different angles and creates volume consisting of 2D slices (CT-like).

So for example:

Input [180,100,100] -> 180 projections of image 100x100 pixels.
Output [100,100,100] -> Volume of size 100x100x100 (100 slices of 2D images)

I have ground truth volumes.

I came up with the idea of using ResNet as Encoder. But I'm not really sure how to implement Decoder and what model would be a good choice for this kind of problem. I did think of U-net architecture, but output dimension is different, so I've abandoned this idea.

I am using PyTorch.

Solution

Specifying the whole network is out of scope of a single answer, but generally you want something like this:

Use a Resnet or vision transformer as the encoder
Use the encoder to map the input down to a latent tensor
Reshape latent tensor as needed
Use ConvTranspose3d layers to upsample latent tensor to desired output size

You can do a UNet-like setup where you have skip connections between encoder layers and decoder layers, you would just need a projection layer to map the encoder activations into a shape compatible with the decoder activations.