Search code examples
pythonarraystensorflowconv-neural-network

Splitting a data set for CNN


Suppose, I have a tensor tfDataSet as follows:

data3d = [
[[7.042   9.118  0.      1.    1.    1.    1.    1.    0.    0.   1.   ]
 [5.781   5.488  7.47    0.    0.    0.    0.    1.    1.    0.   0.   ]
 [5.399   5.166  6.452   0.    0.    0.    0.    0.    1.    0.   0.   ]
 [5.373   4.852  6.069   0.    0.    0.    0.    1.    1.    0.   0.   ]
 [5.423   5.164  6.197   0.    0.    0.    0.    2.    1.    0.   0.   ]]
,
[[ 5.247  4.943  6.434   0.    0.    0.    0.    1.    1.    0.   0.   ]
 [ 5.485  8.103  8.264   0.    0.    0.    0.    1.    0.    0.   1.   ]
 [ 6.675  9.152  9.047   0.    0.    0.    0.    1.    0.    0.   1.   ]
 [ 6.372  8.536 11.954   0.    0.    0.    0.    0.    0.    0.   1.   ]
 [ 5.669  5.433  6.703   0.    0.    0.    0.    0.    1.    0.   0.   ]]
, 
[[5.304   4.924  6.407   0.    0.    0.    0.    0.    1.    0.   0.   ]
 [5.461   5.007  6.088   0.    0.    0.    0.    1.    1.    0.   0.   ]
 [5.265   5.057  6.41    0.    0.    0.    0.    3.    0.    0.   1.   ]
 [5.379   5.026  6.206   0.    0.    0.    0.    1.    1.    0.   0.   ]
 [5.525   5.154  6.      0.    0.    0.    0.    1.    1.    0.   0.   ]]
,
[[5.403   5.173  6.102   0.    0.    0.    0.    1.    1.    0.   0.   ]
 [5.588   5.279  6.195   0.    0.    0.    0.    1.    1.    0.   0.   ]
 [5.381   5.238  6.675   0.    0.    0.    0.    1.    0.    0.   1.   ]
 [5.298   5.287  6.668   0.    0.    0.    0.    1.    1.    0.   0.   ]
 [5.704   7.411  4.926   0.    0.    0.    0.    1.    1.    0.   0.   ]]
,

... ... ... ...
... ... ... ...
]

tfDataSet = tf.convert_to_tensor(data3d)

In each 2D arry inside the tensor, 1st eight columns are features, and the rest three columns are one-hot-encoded labels.

Suppose, I want to feed this tensor into a CNN. For that, I need to do two things:

  • (1) split the data3d into trainData3d, validData3d, and testData3d
  • (2) split each of the above three into featureData3d and labelData3d.

Now, my question is, which one of the above steps should I do first and which one should I do second in order for being least expensive?

Explain why.

If I do #2 first, how can the feature and labels data maintain their correspondence?

Cross-posted: SoftwareEngineering


Solution

  • I'd do #1 -> #2. The advantage would be that, even if you want to shuffle your trainData3d, it would be ensured that the right data and labels still belong together. Otherwise you need to ensure that data and label are shuffled in the same order. I think in TF this is not trivial, as you'd need to shuffle a batch of indices and then use tf.gather, which is kind of slow.
    As for the efficiency, I think the order doesn't matter much. Either way you'd need 6 operations and this would all be slicing operations. With shuffling, I think #1 -> #2 is still better, as you only need to shuffle once, but this also should not matter much.

    If data3d is a TensorFlow Dataset, you could use shuffle, take and skip to slice your data like here.

    Or you could do #2 and then use the train-test-split from sklearn to split into train and test set. This would transform your data back to numpy arrays though.