Search code examples
tensorflowmachine-learningkerassupervised-learning

Keras Model reproducability only possible when running on one Thread?


For the last couple of hours I have worked on my keras/tensorflow code to get reproducable results by seeding every random generator used. Now my solution works, but weirdly only when I run the Code on a single Thread using:

from keras import backend as K
config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=config)
K.set_session(sess)

I can't explain this behaviour so I wanted to know your thoughts on this. Also for furthe runderstanding I will post my full code below:

import tensorflow as tf
random.seed(seed_value)
np.random.seed(seed_value)
tf.set_random_seed(seed_value)

from keras import backend as K
config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=config)
K.set_session(sess)
""""""
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
from sklearn import preprocessing, model_selection
from sklearn.decomposition import PCA
from keras.models import load_model
from keras.utils import np_utils
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from keras.utils.np_utils import to_categorical
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix

from TimingCallback import TimeHistory


def train():
    files = ['RAW_combined_shuffled.csv']
    # select = ['html_tag_script', 'js_max_value_assignments', 'url_found_scripttags', 'url_param_count_"', 'label']
    # read_files = (pd.read_csv(f, usecols=select) for f in files)      # Nur ausgewählte features einlesen
    read_files = (pd.read_csv(f) for f in files)                        # Alle Features einlesen
    data = pd.concat(read_files, ignore_index=True)
    data = data.drop(['data'], axis=1)  # Einzelne columns rauswerfen // durch KNIME erledigt
    # data = shuffle(data)  # Reihenfolge der Datensätze randomisieren  //durch KNIME erledigt

    i = 100
    data_to_predict = data[:i].reset_index(drop=True)  # Daten für Testen des Models raussplitten (anfang bis i)
    real_label = data_to_predict.label
    real_label = np.array(real_label)
    prediction = np.array(data_to_predict.drop(['label'], axis=1))

    data = data[i:].reset_index(drop=True)  # Daten für Training und Test raussplitten (i bis Ende)

    X = data.drop(['label'], axis=1)  # x sind alle columns außer output label
    X = np.array(X)
    Y = data['label']  # y ist column mit output label

    # Transform name species into numerical values
    encoder = LabelEncoder()
    encoder.fit(Y)
    Y = encoder.transform(Y)
    Y = np_utils.to_categorical(Y)

    # We have  classes : the output looks like:
    # 0,1 : Class 1
    # 1,0 : Class 2

    # Trainings- und Testdaten aufteilen; random_state ist seed für Zufallsgenerierung, welche Datensätze train und welche test sind
    train_x, test_x, train_y, test_y = model_selection.train_test_split(X, Y, test_size=0.3, random_state=5, shuffle=False)

    input_dim = len(data.columns) - 1
    print(input_dim)

    callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    time_callback = TimeHistory()

    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(8, input_dim=input_dim, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(2, activation='softmax'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # Start the Training for number of epochs
    # validation_split nimmt Anteil von Traindaten und nutzt diesen Teil zum validieren (bestimmung von val_acc und val_loss) am Ende jeder Epoche
    # verbose gibt anzeige an; 0= ohne anzeige, 1 = mit einem fortschrittsbalken, 2 = fortschrittsbalken pro epoche
    history = model.fit(train_x, train_y, validation_split=0.33, epochs=1000, batch_size=1000, verbose=1, callbacks=[callback, time_callback], shuffle=False)

    scores = model.evaluate(test_x, test_y)

As you will see, I also disabled the shuffling option in the model.fit method and train_test_split. Since I would like to further improve training performance I would normally want to use multiple Threads and therefor CPU Cores.


Solution

  • Yes, this makes sense, the problem is composed of two things:

    • Floating point numbers are only an approximation of real numbers, specially since not all numbers can be represented, and that addition is not associative. For example (a + b) + c != a + (b + c), since each addition could be rounded to the closest floating point, which produces slightly different results. For example in python:

      (0.1 + 0.2) + 0.3 gives 0.6000000000000001 as result.

      0.1 + (0.2 + 0.3) gives 0.6 as result.

    • Using parallel computing with multiple threads introduces more randomness into the process, since you now involve the scheduler and other processes. The problem happens when variables are added from multiple threads, for example when combining results from multiple threads, you usually use a lock and write to the same variable, but the order in which each thread does this is not defined and changes the results.

    This also happens inside GPUs, so unfortunately if you want reproducible results, you need to minimize the use of parallelism across threads (so no multi-threading).