Suppose, I have a tensor tfDataSet
as follows:
data3d = [
[[7.042 9.118 0. 1. 1. 1. 1. 1. 0. 0. 1. ]
[5.781 5.488 7.47 0. 0. 0. 0. 1. 1. 0. 0. ]
[5.399 5.166 6.452 0. 0. 0. 0. 0. 1. 0. 0. ]
[5.373 4.852 6.069 0. 0. 0. 0. 1. 1. 0. 0. ]
[5.423 5.164 6.197 0. 0. 0. 0. 2. 1. 0. 0. ]]
,
[[ 5.247 4.943 6.434 0. 0. 0. 0. 1. 1. 0. 0. ]
[ 5.485 8.103 8.264 0. 0. 0. 0. 1. 0. 0. 1. ]
[ 6.675 9.152 9.047 0. 0. 0. 0. 1. 0. 0. 1. ]
[ 6.372 8.536 11.954 0. 0. 0. 0. 0. 0. 0. 1. ]
[ 5.669 5.433 6.703 0. 0. 0. 0. 0. 1. 0. 0. ]]
,
[[5.304 4.924 6.407 0. 0. 0. 0. 0. 1. 0. 0. ]
[5.461 5.007 6.088 0. 0. 0. 0. 1. 1. 0. 0. ]
[5.265 5.057 6.41 0. 0. 0. 0. 3. 0. 0. 1. ]
[5.379 5.026 6.206 0. 0. 0. 0. 1. 1. 0. 0. ]
[5.525 5.154 6. 0. 0. 0. 0. 1. 1. 0. 0. ]]
,
[[5.403 5.173 6.102 0. 0. 0. 0. 1. 1. 0. 0. ]
[5.588 5.279 6.195 0. 0. 0. 0. 1. 1. 0. 0. ]
[5.381 5.238 6.675 0. 0. 0. 0. 1. 0. 0. 1. ]
[5.298 5.287 6.668 0. 0. 0. 0. 1. 1. 0. 0. ]
[5.704 7.411 4.926 0. 0. 0. 0. 1. 1. 0. 0. ]]
,
... ... ... ...
... ... ... ...
]
tfDataSet = tf.convert_to_tensor(data3d)
In each 2D arry inside the tensor, 1st eight columns are features, and the rest three columns are one-hot-encoded labels.
Suppose, I want to feed this tensor into a CNN. For that, I need to do two things:
data3d
into trainData3d
, validData3d
, and testData3d
featureData3d
and labelData3d
.Now, my question is, which one of the above steps should I do first and which one should I do second in order for being least expensive?
Explain why.
If I do #2 first, how can the feature and labels data maintain their correspondence?
Cross-posted: SoftwareEngineering
I'd do #1 -> #2. The advantage would be that, even if you want to shuffle your trainData3d
, it would be ensured that the right data and labels still belong together. Otherwise you need to ensure that data
and label
are shuffled in the same order. I think in TF this is not trivial, as you'd need to shuffle a batch of indices and then use tf.gather
, which is kind of slow.
As for the efficiency, I think the order doesn't matter much. Either way you'd need 6 operations and this would all be slicing operations. With shuffling, I think #1 -> #2 is still better, as you only need to shuffle once, but this also should not matter much.
If data3d
is a TensorFlow Dataset, you could use shuffle
, take
and skip
to slice your data like here.
Or you could do #2 and then use the train-test-split from sklearn to split into train and test set. This would transform your data back to numpy arrays though.