Search code examples
deep-learningpytorchconv-neural-networkresnetencoder-decoder

Encoder - Decoder neural network architecture with different input and output size


I am trying to figure out what would be a good architecture for neural network that takes projections (2D images) from different angles and creates volume consisting of 2D slices (CT-like).

So for example:

  • Input [180,100,100] -> 180 projections of image 100x100 pixels.
  • Output [100,100,100] -> Volume of size 100x100x100 (100 slices of 2D images)

I have ground truth volumes.

I came up with the idea of using ResNet as Encoder. But I'm not really sure how to implement Decoder and what model would be a good choice for this kind of problem. I did think of U-net architecture, but output dimension is different, so I've abandoned this idea.

I am using PyTorch.


Solution

  • Specifying the whole network is out of scope of a single answer, but generally you want something like this:

    1. Use a Resnet or vision transformer as the encoder
    2. Use the encoder to map the input down to a latent tensor
    3. Reshape latent tensor as needed
    4. Use ConvTranspose3d layers to upsample latent tensor to desired output size

    You can do a UNet-like setup where you have skip connections between encoder layers and decoder layers, you would just need a projection layer to map the encoder activations into a shape compatible with the decoder activations.