Search code examples
pythonnumpyset-difference

Is numpys setdiff1d broken?


To select data for training and validation in my machine learning projects, I usually use numpys masking functionality. So a typical reoccuring block of code to select the indices for validation and test data looks like this:

import numpy as np

validation_split = 0.2

all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
idxTrain = np.setdiff1d(all_idx, idxValid)

Now the following should always be true:

len(all_idx) == len(idxValid)+len(idxTrain)

Unfortunately, I found out that somehow this is not always the case. As I inrease the number of elements that are chosen from the all_idx-array the resulting numbers do not add up properly. Here another standalone example which breaks as soon as I increase the number of randomly chosen validation indices above 1000:

import numpy as np

all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, 1000)
idxTrain = np.setdiff1d(all_idx, idxValid)

print(len(all_idx), len(idxValid), len(idxTrain))

This results in -> 100000, 1000, 99005

I am confused?! Please try yourself. I would be glad to understand this.


Solution

  • idxValid = np.random.choice(all_idx, 10, replace=False)
    

    Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice

    replace boolean, optional
        Whether the sample is with or without replacement