I have time series data which is not monotonically increasing, so calling sort/shuffle is out of the question.
I want to randomly pull out n% of the data, while maintaining it relative order, to act as validation or test set, which can be shown as:
my_ndarray = [ 1, 20, 10, 3, 90, 5, 80, 50, 4, 1] # (number of samples = 1645, number of timesteps = 10, number of features = 7)
# custom_train_test_split()
train = [1, 20, 90, 5, 50, 4, 1]
valid = [10, 3, 80]
I would appreciate some guidance on how to do this efficiently. To my understanding Java style iteration is inefficient in Python. I suspect a 3D boolean table mask would be the pythonic and vectorized way.
Here is what the solution may look like:
Here is the solution using plain Python lists:
my_ndarray = [ 1, 20, 10, 3, 90, 5, 80, 50, 4, 1]
# Add temporary dimension by converting each item
# to a sublist, where the index is the first element of each sublist
nda=[[i,my_ndarray[i]] for i in len(my_ndarray)]
np.random.shuffle(nda)
# Training data is the first 7 items
traindata=nda[0:7]
traindata.sort()
traindata=[x[1] for x in traindata]
# Test data is the rest
testdata=nda[7:10]
testdata.sort()
testdata=[x[1] for x in testdata]