Search code examples
pythonscikit-learn

Python, Sklearn: How to reverse train_test_split of Sklearn?


If I have a dataset X and its label Y, then I split it into training set and test test, with a scle of 0.2, and shuffle with random seed: 11

>>>X.shape
(10000, 50,50)

train_data, test_data, train_label, test_label = train_test_split(X, Y, test_size=0.2, random_state=11, shuffle=True)

How do I know what is the original index of a sample in splited data, which means to reverse the random shuffle?

For example, what is the corresponding X[?] for train_data[123]?


Solution

  • Depending on the type of data , You might be able to get it easily or not. If they are Unique and non repeating rows in train data, you can stringify each element in X and then use the index function of iterators to identify the position.

    Like for example.

    X =  ['i like wanda', 'i dont like anything', 'does this matter', 'this is choice test', 'how are you useful',  'are you mattering', 'this is a random test', 'this is my test', 'i dont like math', 'how can anything matter', 'who does matter', 'i like water', 'this is someone test', 'how does it matter', 'what is horrible',  'i dont like you', 'this is a valid test', 'this is a sample test', 'i like everything', 'i like ice cream', 'how can anything be useful', 'how is this useful', 'this is horrible', 'i dont like jokes']
    
    
    Y = ['0', '0', '1', '0', '1', '1', '0', '0', '0', '1', '1', '0', '0', '1', '1', '0', '0', '0', '0', '0', '1', '1', '0', '0']
    train_data, test_data, train_label, test_label = train_test_split(X, Y, test_size=0.2, random_state=11, shuffle=True)
    for each in train_data:
         print X.index(each)
    

    The above would give me the original index in X. but this is possible in this case because X has distinct elements and is of type string. for more complex datatypes you might have to handle with a bit more processing.