I was wondering if the following codes would give the same results. More specifically if random_state=0
is the same with seed = 0
:
-Using sklearn
:
from sklearn.cross_validation import train_test_split
x = data['x']
y = data['y']
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size = 0.2,random_state = 0)
-Using graphlab
:
import graphlab
train_data,test_data = data.random_split(.8,seed=0)
As far as I know graphlab
is not available in version 3.4 (Correct me if I am wrong), so I was not able to examine myself. Thanks
No, the two libraries do not give the same results for those two code snippets. The scikit-learn function uses a random permutation to shuffle the data, then splits the data into the desired fraction. The SFrame.random_split
method is different; it randomly samples rows from the original data based on the specified fraction.
Not only that, the random number generators for the two libraries are different, so setting the random state and seed to the same value won't have any effect.
I verified this with GraphLab Create 1.7.1 and Scikit-learn 0.17.
import numpy as np
import graphlab as gl
from sklearn.cross_validation import train_test_split
sf = graphlab.SFrame(np.random.rand(10, 1))
sf = sf.add_row_number('row_id')
sf_train, sf_test = sf.random_split(0.6, seed=0)
df_train, df_test = train_test_split(sf.to_dataframe(),
test_size=0.4,
random_state=0)
sf_train
is:
+--------+-------------------+
| row_id | X1 |
+--------+-------------------+
| 0 | [0.459467634448] |
| 4 | [0.424260273035] |
| 6 | [0.143786736949] |
| 7 | [0.0871068666212] |
| 8 | [0.74631952689] |
| 9 | [0.37570258651] |
+--------+-------------------+
[6 rows x 2 columns]
while df_train
looks like:
row_id X1
1 1 [0.561396445174]
6 6 [0.143786736949]
7 7 [0.0871068666212]
3 3 [0.397315891635]
0 0 [0.459467634448]
5 5 [0.033673713722]
Definitely not the same.