machine-learning azure-machine-learning-service random-seed iris-dataset

What is Random seed in Azure Machine Learning?

I am learning Azure Machine Learning. I am frequently encountering the Random Seed in some of the steps like,

Split Data
Untrained algorithm models as Two Class Regression, Multi-class regression, Tree, Forest,..

In the tutorial, they choose Random Seed as '123'; trained model has high accuracy but when I try to choose other random integers like 245, 256, 12, 321,.. it did not do well.

Questions

What is a Random Seed Integer?
How to carefully choose a Random Seed from range of integer values? What is the key or strategy to choose it?
Why does Random Seed significantly affect the ML Scoring, Prediction and Quality of the trained model?

Pretext

I have Iris-Sepal-Petal-Dataset with Sepal (Length & Width) and Petal (Length & Width)
Last column in data-set is 'Binomial ClassName'
I am training the data-set with Multiclass Decision Forest Algorithm and splitting the data with different random seeds 321, 123 and 12345 in order
It affects the final quality of trained model. Random seed#123 being best of Prediction probability score: 1.

Observations

1. Random seed: 321

2. Random seed: 123

3. Random seed: 12345

Solution

What is a Random Seed Integer?

Will not go into any details regarding what a random seed is in general; there is plenty of material available by a simple web search (see for example this SO thread).

Random seed serves just to initialize the (pseudo)random number generator, mainly in order to make ML examples reproducible.

How to carefully choose a Random Seed from range of integer values? What is the key or strategy to choose it?

Arguably this is already answered implicitly above: you are simply not supposed to choose any particular random seed, and your results should be roughly the same across different random seeds.

Why does Random Seed significantly affect the ML Scoring, Prediction and Quality of the trained model?

Now, to the heart of your question. The answer here (i.e. with the iris dataset) is the small-sample effects...

To start with, your reported results across different random seeds are not that different. Nevertheless, I agree that, at first sight, a difference in macro-average precision of 0.9 and 0.94 might seem large; but looking more closely it is revealed that the difference is really not an issue. Why?

Using the 20% of your (only) 150-samples dataset leaves you with only 30 samples in your test set (where the evaluation is performed); this is stratified, i.e. about 10 samples from each class. Now, for datasets of that small size, it is not difficult to imagine that a difference in the correct classification of only 1-2 samples can have this apparent difference in the performance metrics reported.

Let's try to verify this in scikit-learn using a decision tree classifier (the essence of the issue does not depend on the specific framework or the ML algorithm used):

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Result:

[[10  0  0]
 [ 0  9  1]
 [ 0  0 10]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.90      0.95        10
           2       0.91      1.00      0.95        10

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30

Let's repeat the code above, changing only the random_state argument in train_test_split; for random_state=123 we get:

[[10  0  0]
 [ 0  7  3]
 [ 0  2  8]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.78      0.70      0.74        10
           2       0.73      0.80      0.76        10

   micro avg       0.83      0.83      0.83        30
   macro avg       0.84      0.83      0.83        30
weighted avg       0.84      0.83      0.83        30

while for random_state=12345 we get:

[[10  0  0]
 [ 0  8  2]
 [ 0  0 10]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.80      0.89        10
           2       0.83      1.00      0.91        10

   micro avg       0.93      0.93      0.93        30
   macro avg       0.94      0.93      0.93        30
weighted avg       0.94      0.93      0.93        30

Looking at the absolute numbers of the 3 confusion matrices (in small samples, percentages can be misleading), you should be able to convince yourself that the differences are not that big, and they can be arguably justified by the random element inherent in the whole procedure (here the exact split of the dataset into training and test).

Should your test set be significantly bigger, these discrepancies would be practically negligible...

A last notice; I have used the exact same seed numbers as you, but this does not actually mean anything, as in general the random number generators across platforms & languages are not the same, hence the corresponding seeds are not actually compatible. See own answer in Are random seeds compatible between systems? for a demonstration.