How to get an autoencoder to work on a small image dataset

I have a dataset of three images. When I create an autoencoder to train on those three images, the output I get is the exact same for each image, and it looks like a blend of all three images.

My result looks like this:

Input image 1:

Output image 1:

Input image 2:

Output image 2:

Input image 3:

Output image 3:

So you can see that the output is giving the exact same thing for each of the inputs, and while it matches each relatively well, it's not perfect.

This is a three image dataset - it should be perfect (or at least different for each of the images).

I'm concerned about this three image dataset because when I do the 500 image dataset, all I get back is a white blank screen, because that's the best average of all the images.

I'm using Keras, and the code is really simple.

from keras.models                   import Sequential
from keras.layers                   import Dense, Flatten, Reshape
import numpy as np

# returns a numpy array with shape (3, 24, 32, 1)
# there are 3 images that are each 24x32 and are black and white (1 color channel)
x_train = get_data()

# this is the size of our encoded representations
# encode down to two numbers (I have tested using 3; I still have the same issue)
encoding_dim = 2
# the shape without the batch amount
input_shape = x_train.shape[1:]
# how many output neurons we need to create an image
input_dim = np.prod(input_shape)

# simple feedforward network
# I've also tried convolutional layers; same issue
autoencoder = Sequential([
              Flatten(), # flatten
              Dense(encoding_dim), # encode
              Dense(input_dim), # decode
              Reshape(input_shape) # reshape decoding
])

# adadelta optimizer works better than adam, same issue with both
autoencoder.compile(optimizer='adadelta', loss='mse')

# train it to output the same thing it gets as input
# I've tried epochs up to 30000 with no improvement;
# still predicts the same image for all three inputs
autoencoder.fit(x_train, x_train,
            epochs=10,
            batch_size=1,
            verbose=1)

out = autoencoder.predict(x_train)

I then take the outputs (out[0], out[1], out[2]) and convert them back into images. You can see the output images above.

I'm worried because this shows that the autoencoder isn't preserving any information about the input image, which is not how an encoder should perform.

How can I get the encoder to show differences in outputs based on the input images?

EDIT:

One of my coworkers had the suggestion of not even using an autoencoder, but a 1 layer feedforward neural network. I tried this, and the same thing happened, until I set the batch size to 1 and trained for 1400 epochs, and then it worked perfectly. This leads me to think that more epochs would solve this issue, but I'm not sure yet.

EDIT:

Training for 10,000 epochs (with batch-size 3) made the second image look different than the first and third on the encoder, which is exactly what happened on the non-encoder version when running for around 400 epochs (also with batch-size 3) providing further evidence that training for more epochs may be the solution.

Going to test using batch size 1, and see if that helps even more, and then try training for very many epochs and see if that completely solves the issue.

Solution

My encoding dimension was way too small. Trying to encode 24x32 images into 2 numbers (or 3 numbers) is just too much for the autoencoder to handle.

By raising encoding_dim to 32, the issue was pretty much solved. I was able to use the default learning rate with the Adadelta optimizer. My data didn't even need to be normalized (just dividing all of the pixels by 255 worked).

The "binary_crossentropy" loss function seemed to work a bit faster/better than "mse", although "mse" (mean-squared-error) worked just fine.

In the first few hundred epochs though, it does look like it's blending the images. However, as it trains for longer, the more it starts to separate.

I also made the output activation of the encode layer be relu and the activation of the decode layer be sigmoid. I'm not sure how much of an effect that had on the output - I haven't tested it.

This page helped a ton in understanding what I did wrong. I just copy/pasted the code and found out it worked on my dataset, so the rest was figuring out what I did wrong.

Here's some images of their simple autoencoder architecture working on my dataset (which was my first sign of hope):

500 Epochs:

2000 Epochs: