python tensorflow tensorflow2.0 tensorflow-datasets

spliting custom binary dataset in train/test subsets using tensorflow io

I am trying to use local binary data to train a network to perform regression inference.

Each local binary data has the following layout:

and the whole data consists of several *.bin files with the layout above. Each file has a variable number of sequences of 403*4 bytes. I was able to read one of those files using the following code:

import tensorflow as tf

RAW_N = 2 + 20*20 + 1

def convert_binary_to_float_array(register):
     return tf.io.decode_raw(register, out_type=tf.float32)

raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(map_func=convert_binary_to_float_array)

Now, I need to create 4 datasets train_data, train_labels, test_data, test_labels as follows:

train_data, train_labels, test_data, test_labels = prepare_ds(raw_dataset, 0.8)

and use them to train & evaluate:

model = build_model()

history = model.fit(train_data, train_labels, ...)

loss, mse = model.evaluate(test_data, test_labels)

My question is: how to implement function prepare_ds(dataset, frac)?

def prepare_ds(dataset, frac):
    ...

I have tried to use tf.shape, tf.reshape, tf.slice, subscription [:] with no success. I realized that those functions doesn't work properly because after the map() call raw_dataset is a MapDataset (as a result of the eager execution concerns).

Solution

If the meta-data is suppose to be part of your inputs, which I am assuming, you could try something like this:

import random
import struct
import tensorflow as tf
import numpy as np

RAW_N = 2 + 20*20 + 1

bytess = random.sample(range(1, 5000), RAW_N*4)
with open('mydata.bin', 'wb') as f:
  f.write(struct.pack('1612i', *bytess))

def decode_and_prepare(register):
  register = tf.io.decode_raw(register, out_type=tf.float32)
  inputs = register[:402]
  label = register[402:]
  return inputs, label

total_data_entries = 8
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['/content/mydata.bin', '/content/mydata.bin'], record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(decode_and_prepare)
raw_dataset = raw_dataset.shuffle(buffer_size=total_data_entries)

train_ds_size = int(0.8 * total_data_entries)
test_ds_size = int(0.2 * total_data_entries)

train_ds = raw_dataset.take(train_ds_size)
remaining_data = raw_dataset.skip(train_ds_size)  
test_ds = remaining_data.take(test_ds_size)

Note that I am using the same bin file twice for demonstration purposes. After running that code snippet, you could feed the datasets to your model like this:

model = build_model()

history = model.fit(train_ds, ...)

loss, mse = model.evaluate(test_ds)

as each dataset contains the inputs and the corresponding labels.