I'm following a machine learning course. I have a simple linear regression (LR) problem to help me get used to TensorFlow. The LR problem is to find parameters a
and b
such that Y = a*X + b
approximates an (x, y)
point cloud (which I generated myself for the sake of simplicity).
I am solving this LR problem using a 'fixed step size gradient descent (FSSGD)'. I implemented it using TensorFlow and it works but I noticed that it is really slow both on GPU and CPU. Because I was curious I implemented the FSSGD myself in Python/NumPy and as expected this runs much faster, about:
If TensorFlow is this slow, I cannot imagine that so many people are using this framework. So I must be doing something wrong. Can anyone help me so I can speedup my TensorFlow implementation.
I'm NOT interested in the difference between the CPU and GPU performance. Both performance indicators are merely provided for completeness and illustration. I'm interested in why my TensorFlow implementation is so much slower than a raw Python/NumPy implementation.
As reference, I add my code below.
Python v3.7.9 x64
.tensorflow-gpu==1.15
for now (because the course uses TensorFlow v1)My FSSGD implementation using TensorFlow (execution time about 40 sec @CPU to 80 sec @GPU):
#%% General imports
import numpy as np
import timeit
import tensorflow.compat.v1 as tf
#%% Get input data
# Generate simulated input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define tensorflow model
# Define data size
n_samples = x_data_input.shape[0]
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
# Define variables to be learned
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
W = tf.get_variable("weights", (1, 1), initializer=tf.constant_initializer(0.0))
b = tf.get_variable("bias", (1,), initializer=tf.constant_initializer(0.0))
# Define loss function
Y_pred = tf.matmul(X, W) + b
loss = tf.reduce_sum((Y - Y_pred) ** 2 / n_samples) # Quadratic loss function
# %% Solve tensorflow model
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
#Construct TensorFlow optimizer
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
opt = tf.train.GradientDescentOptimizer(learning_rate = 1e-4)
opt_operation = opt.minimize(loss, name="GDO")
#To measure execution time
time_start = timeit.default_timer()
with tf.Session() as sess:
#Initialize variables
sess.run(tf.global_variables_initializer())
#Train variables
for index in range(int(total_iterations)):
_, loss_val_tmp = sess.run([opt_operation, loss], feed_dict={X: x_data, Y: y_data})
#Get final values of variables
W_val, b_val, loss_val = sess.run([W, b, loss], feed_dict={X: x_data, Y: y_data})
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W_val[0,0]))
print('b_val = {0:0.3f}'.format(b_val[0]))
print('')
My own python FSSGD implementation (execution time about 4 sec):
#%% General imports
import numpy as np
import timeit
#%% Get input data
# Define input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define Gradient Descent (GD) model
# Define data size
n_samples = x_data_input.shape[0]
#Initialize data
W = 0.0 # Initial condition
b = 0.0 # Initial condition
# Compute initial loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/n_samples # Quadratic loss function
#%% Execute Gradient Descent algorithm
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
GD_stepsize = 1e-4 # Gradient Descent fixed step size
#To measure execution time
time_start = timeit.default_timer()
for index in range(int(total_iterations)):
#Compute gradient (derived manually for the quadratic cost function)
loss_gradient_W = 2.0/n_samples*np.sum(-x_data_input*(y_data_input - y_gd_approx))
loss_gradient_b = 2.0/n_samples*np.sum(-1*(y_data_input - y_gd_approx))
#Update trainable variables using fixed step size gradient descent
W = W - GD_stepsize * loss_gradient_W
b = b - GD_stepsize * loss_gradient_b
#Compute loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/x_data_input.shape[0]
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W))
print('b_val = {0:0.3f}'.format(b))
print('')
The actual answer to my question is hidden in the various comments. For future readers, I will summarize these findings in this answer.
This part of the answer is actually quite logically.
Each iteration (= each call of Session.run()
) TensorFlow performs computations. TensorFlow has a large overhead for starting each computation. On GPU, this overhead is even worse than on CPU. However, TensorFlow executes the actual computations very efficient and more efficiently than the above raw Python/NumPy implementation does.
So, when the number of data points is increased, and therefore the number of computations per iteration you will see that the relative performances between TensorFlow and Python/NumPy shifts in the advantage of TensorFlow. The opposite is also true.
The problem described in the question is very small meaning that the number of computation is very low while the number of iterations is very large. That is why TensorFlow performs so badly. This type of small problems is not the typical use case for which TensorFlow was designed.
Still the execution time of the TensorFlow script can be reduced a lot! To reduce the execution time the number of iterations must be reduced (no matter the size of the problem, this is a good aim anyway).
As @amin's pointed out, this is achieved by scaling the input data. A very briefly explanation why this works: the size of the gradient and variable updates are more balanced compared to the absolute values for which the values are to be found. Therefore, less steps (= iterations) are required.
Followings @amin's advise, I finally ended up by scaling my x-data as follows (some code is repeated to make the position of the new code clear):
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
### START NEW CODE ###
# Scale x_data
x_mean = np.mean(x_data)
x_std = np.std(x_data)
x_data = (x_data - x_mean) / x_std
### END NEW CODE ###
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
Scaling speed up the convergence by a factor 1000. Instead of 1e5 iterations
, 1e2 iterations
are needed. This is partially because a maximum step size of 1e-1
can be used instead of a step size of 1e-4
.
Please note that the found weight and bias are different and that you must feed scaled data from now on.
Optionally, you can choose to unscale the found weight and bias so you can feed unscaled data. Unscaling is done using this code (put somewhere at the end of the code):
#%% Unscaling
W_val_unscaled = W_val[0,0]/x_std
b_val_unscaled = b_val[0]-x_mean*W_val[0,0]/x_std