Search code examples
pythonscikit-learngensimazure-machine-learning-service

Models generate different results when moving to Azure Machine Learning Studio


We developed a Jupyter Notebook in a local machine to train models with the Python (V3) libraries sklearn and gensim. As we set the random_state variable to a fixed integer, the results were always the same.

After this, we tried moving the notebook to a workspace in Azure Machine Learning Studio (classic), but the results differ even if we leave the random_state the same.

As suggested in the following links, we installed the same libraries versions and checked the MKL version was the same and the MKL_CBWR variable was set to AUTO.

t-SNE generates different results on different machines

Same Python code, same data, different results on different machines

Still, we are not able to get the same results.

What else should we check or why is this happening?

Update

If we generate a pkl file in the local machine and import it in AML, the results are the same (as the intention of the pkl file is).

Still, we are looking to get the same results (if possible) without importing the pkl file.

Library versions

gensim 3.8.3.
sklearn 0.19.2.
matplotlib 2.2.3.
numpy 1.17.2.
scipy 1.1.0.

Code

Full code can be found here, sample data link inside.

import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt

from gensim.models import KeyedVectors
%matplotlib inline

import time

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import seaborn as sns

wordvectors_file_vec = '../libraries/embeddings-new_large-general_3B_fasttext.vec'
wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec)

math_quests = # some transformations using wordvectors

df_subset = pd.DataFrame()

pca = PCA(n_components=3, random_state = 42)
pca_result = pca.fit_transform(mat_quests)
df_subset['pca-one'] = pca_result[:,0]
df_subset['pca-two'] = pca_result[:,1] 

time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300, random_state = 42)
tsne_results = tsne.fit_transform(mat_quests)

df_subset['tsne-2d-one'] = tsne_results[:,0]
df_subset['tsne-2d-two'] = tsne_results[:,1]

pca_50 = PCA(n_components=50, random_state = 42)
pca_result_50 = pca_50.fit_transform(mat_quests)
print('Cumulative explained variation for 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_)))

time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300, random_state = 42)
tsne_pca_results = tsne.fit_transform(pca_result_50)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

Solution

  • Definitely empathize with the issue you're having. Every data scientist has struggled with this at some point.

    The hard truth I have for you is that Azure ML Studio (classic) isn't really capable of solving this "works on my machine" problem. However, the good news is that Azure ML Service is incredible at it. Studio classic doesn't let you define custom environments deterministically, only add and remove packages (and not so well even at that)

    Because ML Service's execution is built on top of Docker containers and conda environments, you can feel more confident in repeated results. I highly recommend you take the time to learn it (and I'm also happy to debug any issues that come up). Azure's MachineLearningNotebooks repo has a lot of great tutorials for getting started.

    I spent two hours making a proof of concept that demonstrate how ML Service solves the problem you're having by synthesizing:

    I'm no T-SNE expert, but from the screenshot below, you can see that the t-sne outputs are the same when I run the script locally and remotely. This might be possible with Studio classic, but it would be hard to guarantee that it will always work.

    Azure ML Experiment Results Page