Search code examples
scikit-learntrain-test-split

sklearn train_test_split returns some elements in both test/train


I have a data-set X with 260 unique observations.

when running x_train,x_test,_,_=test_train_split(X,y,test_size=0.2) I would assume that [p for p in x_test if p in x_train] would be empty, but it is not. Actually it turns out that only two observations in x_test is not in x_train.

Is that intended or...?

EDIT (posted the data I am using):

from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split as split
import numpy as np

DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])

x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)

len([p for p in x_test if p in x_train]) #is not 0

EDIT 2.0: Showing that the test works

a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])

len([p for p in a if p in b]) #1

Solution

  • This is not a bug with the implementation of train_test_split in sklearn, but a weird peculiarity of how the in operator works on numpy arrays. The in operator first does an elementwise comparison between two arrays, and returns True if ANY of the elements match.

    import numpy as np
    
    a = np.array([[1, 2, 3], [4, 5, 6]])
    b = np.array([[6, 7, 8], [5, 5, 5]])
    a in b # True
    

    The correct way to test for this kind of overlap is using the equality operator and np.all and np.any. As a bonus, you also get the indices that overlap for free.

    import numpy as np
    
    a = np.array([[1, 2, 3], [4, 5, 6]])
    b = np.array([[6, 7, 8], [5, 5, 5], [7, 8, 9]])
    a in b # True
    
    z = np.any(np.all(a == b[:, None, :], -1))  # False
    
    a = np.array([[1, 2, 3], [4, 5, 6]])
    b = np.array([[6, 7, 8], [1, 2, 3], [7, 8, 9]])
    a in b # True
    
    overlap = np.all(a == b[:, None, :], -1)
    z = np.any(overlap)  # True
    indices = np.nonzero(overlap)  # (1, 0)