python machine-learning neural-network keras autoencoder

Keras autoencoder

I've worked a long time ago with neural networks in Java and now I'm trying to learn to use TFLearn and Keras in Python.

I'm trying to build an autoencoder, but as I'm experiencing problems the code I show you hasn't got the bottleneck characteristic (this should make the problem even easier).

On the following code I create the network, the dataset (two random variables), and after train it plots the correlation between each predicted variable with its input.

What the network should learn, is to output the same input that receives.

import matplotlib.pyplot as plt
import numpy as np
from keras.layers import Input, Dense
from keras.models import Model
from keras.models import load_model
from loaders.nslKddCup99.nslKddCup99Loader import NslKddCup99

def buildMyNetwork(inputs, bottleNeck):
    inputLayer = Input(shape=(inputs,))
    autoencoder = Dense(inputs*2, activation='relu')(inputLayer)
    autoencoder = Dense(inputs*2, activation='relu')(autoencoder)
    autoencoder = Dense(bottleNeck, activation='relu')(autoencoder)
    autoencoder = Dense(inputs*2, activation='relu')(autoencoder)
    autoencoder = Dense(inputs*2, activation='relu')(autoencoder)
    autoencoder = Dense(inputs, activation='sigmoid')(autoencoder)
    autoencoder = Model(input=inputLayer, output=autoencoder)
    autoencoder.compile(optimizer='adadelta', loss='mean_squared_error')
    return autoencoder


dataSize = 1000
variables = 2
data = np.zeros((dataSize,variables))
data[:, 0] = np.random.uniform(0, 0.8, size=dataSize)
data[:, 1] = np.random.uniform(0, 0.1, size=dataSize)

trainData, testData = data[:900], data[900:]

model = buildMyNetwork(variables,2)
model.fit(trainData, trainData, nb_epoch=2000)
predictions = model.predict(testData)

for x in range(variables):
    plt.scatter(testData[:, x], predictions[:, x])
    plt.show()
    plt.close()

Even though some times the result is acceptable, many others isn't, I know neural networks have weight random initialization and therefore it may converge to different solutions, but I think this is too much and there may be some mistake in my code.

Sometimes correlation is acceptable

Others is quite lost

UPDATE:

Thanks Marcin Możejko!

Indeed that was the problem, my original question was because I was trying to build an autoencoder, so to be coherent with the title here comes an example of autoencoder (just making a more complex dataset and changing the activation functions):

import matplotlib.pyplot as plt
import numpy as np
from keras.layers import Input, Dense
from keras.models import Model
from keras.models import load_model
from loaders.nslKddCup99.nslKddCup99Loader import NslKddCup99

def buildMyNetwork(inputs, bottleNeck):
    inputLayer = Input(shape=(inputs,))
    autoencoder = Dense(inputs*2, activation='tanh')(inputLayer)
    autoencoder = Dense(inputs*2, activation='tanh')(autoencoder)
    autoencoder = Dense(bottleNeck, activation='tanh')(autoencoder)
    autoencoder = Dense(inputs*2, activation='tanh')(autoencoder)
    autoencoder = Dense(inputs*2, activation='tanh')(autoencoder)
    autoencoder = Dense(inputs, activation='tanh')(autoencoder)
    autoencoder = Model(input=inputLayer, output=autoencoder)
    autoencoder.compile(optimizer='adadelta', loss='mean_squared_error')
    return autoencoder


dataSize = 1000
variables = 6
data = np.zeros((dataSize,variables))
data[:, 0] = np.random.uniform(0, 0.5, size=dataSize)
data[:, 1] = np.random.uniform(0, 0.5, size=dataSize)
data[:, 2] = data[:, 0] + data[:, 1]
data[:, 3] = data[:, 0] * data[:, 1]
data[:, 4] = data[:, 0] / data[:, 1]
data[:, 5] = data[:, 0] ** data[:, 1]

trainData, testData = data[:900], data[900:]

model = buildMyNetwork(variables,2)
model.fit(trainData, trainData, nb_epoch=2000)
predictions = model.predict(testData)

for x in range(variables):
    plt.scatter(testData[:, x], predictions[:, x])
    plt.show()
    plt.close()

For this example I used TanH activation function, but I tried with others and worked aswell. The dataset has now 6 variables but the autoencoder has a bottleneck of 2 neurons; as long as variables 2 to 5 are formed combining variables 0 and 1, the autoencoder only needs to pass the information of those two and learn the functions to generate the other variables on the decoding phase. The example above shows how all functions are learnt but one, the division... I don't know yet why.

Solution

I think that your case is relatively easy to explain why your network might fail to learn an identity function. Let's go through your example:

Your input comes from 2d space - and it doesn't lie on a 1d or 0d submanifold - due to uniform distribiution. From this it's easy to see that in order to get an identity function from your autoencoder every layer should be able to represent a function which range is at least two dimensional, beacuse the output of your last layer should also lie on a 2d manifold.
Let's go through your network and check if it satisfy the condtion need:
```
inputLayer = Input(shape=(2,))
autoencoder = Dense(4, activation='relu')(inputLayer)
autoencoder = Dense(4, activation='relu')(autoencoder)
autoencoder = Dense(2, activation='relu')(autoencoder) # Possible problems here
```
You may see that the bottleneck might cause a problem - for this layer it might be hard to satisfy the condition from the first point. For this layer - in order to get the 2-dimensional output range you need to have weights which will make all the examples not falling into saturation region of relu (in this case all this samples will be squashed to 0 in one of the units - what makes impossible for range to be "fully" 2d). So basically - the probability that this will not happen is relatively small. Also the probability that backpropagation will not move this unit to this region also cannot be neglected.

UPDATE:

In a comment the question was asked why optimizer fail to prevent or undo the saturation. It's an example of a one of the important relu downsides - once an example falls into a relu saturation region - this example doesn't directly take part in learning of a given unit. It could affect it by influencing previous units - but due to 0 derivative - this influence is not direct. So basically unsaturating an example comes from a side effect - not the direct action of an optimizer.