I'm try to solve a multi-linear-regression problem with a very simple linear network. The network only consists of a single dense layer as its output layer and the activation function is set to linear. I synthesize the output data Y by multiplying the input data X by the system (weight) matrix A: Y=A.X . Both X and A contain random numbers with normal or uniform distributions (the problem happens regardless). In this case, the network reaches above 99% accuracy in only 7 Epochs over 1000 samples as one would expect.
Now, if I synthesize X from Y, which this time around has uniform random numbers, using A's inverse: X = inv(A).Y , and try to train the network, after two hundred Epochs, the accuracy only reaches 94%.
Why is there such a huge disparity between the two cases even-though the system matrix (weights) is exactly the same. The only difference is related to the random distribution of X and Y. If I'm forced to follow the second case, how can I improve the trainability of my network so that it can be trained in few epochs.
I have tried different optimizers, initializers and regularizations but they didn't help.
Here's the code for the version that doesn't converge so well. To get the first version I use gen1
in Dataset.from_generator(gen2, ...)
instead of gen2
.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
N = 256
np.random.seed(0)
A = np.random.normal(0,.4,(N,N))
Ainv = np.linalg.inv(A)
import itertools
input_size = N
def gen1():
for i in itertools.count(1):
X = np.random.rand(N,1)-.5
Y = np.dot(A,X)
yield (X[:,0],Y[:,0])
def gen2():
for i in itertools.count(1):
Y = np.random.rand(N,1)-0.5
X = np.dot(Ainv,Y)
yield (X[:,0],Y[:,0])
dataset = tf.data.Dataset.from_generator(
gen2,
(tf.float64, tf.float64),
(tf.TensorShape([N]), tf.TensorShape([N])))
train_ds = dataset.take(950)
valid_ds = dataset.skip(950).take(50)
#train_ds = train_ds.shuffle(2000, reshuffle_each_iteration = True)
train_ds = train_ds.batch(1)
valid_ds = valid_ds.batch(1)
from keras.layers import Input, Dense
from keras.models import Model
from keras import backend
def rabs(y_t, y_p):
return backend.mean(backend.abs(y_p - y_t), axis=-1)/(tf.keras.backend.max(y_t) - tf.keras.backend.min(y_t))*100
inp = Input(shape=(input_size,))
out = Dense(N, activation='linear')(inp)
autoencoder = Model(inp, out)
#opt = tf.keras.optimizers.Adam(learning_rate=.0001)
opt = tf.keras.optimizers.SGD(learning_rate=.2, momentum=0.7)
autoencoder.compile(optimizer= opt,
loss=tf.keras.losses.MeanSquaredError(),metrics= [rabs])
autoencoder.summary()
autoen_model = autoencoder.fit(train_ds, validation_data = valid_ds, epochs = 200)
plt.plot(autoen_model.history['rabs'])
plt.plot(autoen_model.history['val_rabs'])
plt.title('Model Accuracy')
plt.ylabel('Relative Absolute Mean Error %')
plt.xlabel('Epoch')
plt.legend(['Training set', 'Validation set'], loc='upper left')
plt.show()
Training graphs
Case 1: Y synthesized
Case 2: X synthesized
I'm going to ignore that you're doing stochastic gradient descent, and just imagine that you're working with the entire dataset for each step. In this case, your problem in both cases is to minimize ||Y-AX||^2 over A.
After doing some algebra, you can write this as a quadratic optimization problem of the form
\min_{z} z^T Q z + b^T z,
where z \in R^{256^2} represents the entries of the matrix A, Q is a symmetric matrix obtained only from X, and b is a vector obtained from X and Y. What you are asking Tensorflow to do is to solve this problem using gradient descent.
The convergence rate of gradient descent on this type of problem is governed by the condition number of Q, which is its largest eigenvalue divided by its smallest. A condition number that is much larger than one leads to slow gradient descent, as some variables update much faster than others. A condition number closer to one is best for obtaining fast convergence. In Guler's Foundations of Optimization (Section 14.2) you can read more about the effect of condition number on convergence of (a variant of) gradient descent, though there are probably better resources on this out there.
In your case, the eigenvalues of Q are just the eigenvalues of XX^T, which are the squared singular values of X. For the first dataset, X is uniformly distributed, and in the second X= A_0^{-1} Y, where Y is uniformly distributed.
The difference in convergence you are observing comes from the fact that multiplication by A_0^{-1} wildly increases the condition number of your matrix. In the following python code I did some random trials of this, and found that the condition number of the second matrix is way bigger. Thousands of times bigger.
import numpy as np
cond1 = []
cond2 = []
for i in range(10):
A = np.random.normal(0,0.4,(256,256))
Ainv = np.linalg.inv(A)
X1 = np.random.rand(256,950)
X1sv = np.linalg.svd(X1, compute_uv = False)
Y = np.random.rand(256,950)
X2 = np.dot(Ainv,Y)
X2sv = np.linalg.svd(X2, compute_uv = False)
cond1.append((X1sv.max()/X1sv.min())**2)
cond2.append((X2sv.max()/X2sv.min())**2)
cond1 = np.array(cond1)
cond2 = np.array(cond2)
print('X1\'s condition number has mean {:.2f} and std {:.2f} '.format(cond1.mean(), cond1.std()))
print('X2\'s condition number has mean {:.2f} and std {:.2f} '.format(cond2.mean(), cond2.std()))
print('X2\'s mean condition number is {:.1f} times as big as X1\'s'.format(cond2.mean()/cond1.mean()))
So that's my guess as to why you're seeing worse convergence for the second case than the first. I could be wrong, but maybe this will point you in the right direction.
There are a couple of solutions to this: