Search code examples
pythonscikit-learngraphlab

Seed options: Using different packages for machine learning in Python


I was wondering if the following codes would give the same results. More specifically if random_state=0 is the same with seed = 0:

-Using sklearn:

from sklearn.cross_validation import train_test_split
x = data['x']
y = data['y']
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size = 0.2,random_state = 0)

-Using graphlab:

import graphlab
train_data,test_data = data.random_split(.8,seed=0)

As far as I know graphlab is not available in version 3.4 (Correct me if I am wrong), so I was not able to examine myself. Thanks


Solution

  • No, the two libraries do not give the same results for those two code snippets. The scikit-learn function uses a random permutation to shuffle the data, then splits the data into the desired fraction. The SFrame.random_split method is different; it randomly samples rows from the original data based on the specified fraction.

    Not only that, the random number generators for the two libraries are different, so setting the random state and seed to the same value won't have any effect.

    I verified this with GraphLab Create 1.7.1 and Scikit-learn 0.17.

    import numpy as np
    import graphlab as gl
    from sklearn.cross_validation import train_test_split
    
    sf = graphlab.SFrame(np.random.rand(10, 1))
    sf = sf.add_row_number('row_id')
    
    sf_train, sf_test = sf.random_split(0.6, seed=0)
    df_train, df_test = train_test_split(sf.to_dataframe(),
                                         test_size=0.4,
                                         random_state=0)
    

    sf_train is:

    +--------+-------------------+
    | row_id |         X1        |
    +--------+-------------------+
    |   0    |  [0.459467634448] |
    |   4    |  [0.424260273035] |
    |   6    |  [0.143786736949] |
    |   7    | [0.0871068666212] |
    |   8    |  [0.74631952689]  |
    |   9    |  [0.37570258651]  |
    +--------+-------------------+
    [6 rows x 2 columns]
    

    while df_train looks like:

       row_id                 X1
    1       1   [0.561396445174]
    6       6   [0.143786736949]
    7       7  [0.0871068666212]
    3       3   [0.397315891635]
    0       0   [0.459467634448]
    5       5   [0.033673713722]
    

    Definitely not the same.