Search code examples
pythonnumpyrandomtrain-test-split

Random sample without replacement while maintaining natural order of tabular data


I have time series data which is not monotonically increasing, so calling sort/shuffle is out of the question.

I want to randomly pull out n% of the data, while maintaining it relative order, to act as validation or test set, which can be shown as:

my_ndarray = [ 1, 20, 10, 3, 90, 5, 80, 50, 4, 1] # (number of samples = 1645, number of timesteps = 10, number of features = 7)
# custom_train_test_split()
train = [1, 20, 90, 5, 50, 4, 1]
valid = [10, 3, 80]

I would appreciate some guidance on how to do this efficiently. To my understanding Java style iteration is inefficient in Python. I suspect a 3D boolean table mask would be the pythonic and vectorized way.


Solution

  • Here is what the solution may look like:

    • Add a temporary additional dimension to the array, in which you add indices to each item in the array.
    • Shuffle the array.
    • Take the desired portions of the array, then sort each of them by the dimension.
    • Remove the temporary dimension from the chosen portions.

    Here is the solution using plain Python lists:

    my_ndarray = [ 1, 20, 10, 3, 90, 5, 80, 50, 4, 1] 
    # Add temporary dimension by converting each item 
    # to a sublist, where the index is the first element of each sublist
    nda=[[i,my_ndarray[i]] for i in len(my_ndarray)]
    np.random.shuffle(nda)
    # Training data is the first 7 items
    traindata=nda[0:7]
    traindata.sort()
    traindata=[x[1] for x in traindata]
    # Test data is the rest
    testdata=nda[7:10]
    testdata.sort()
    testdata=[x[1] for x in testdata]