Search code examples
pythontensorflowtensorflow2.0tensorflow-datasets

spliting custom binary dataset in train/test subsets using tensorflow io


I am trying to use local binary data to train a network to perform regression inference.

Each local binary data has the following layout:

enter image description here

and the whole data consists of several *.bin files with the layout above. Each file has a variable number of sequences of 403*4 bytes. I was able to read one of those files using the following code:

import tensorflow as tf

RAW_N = 2 + 20*20 + 1

def convert_binary_to_float_array(register):
     return tf.io.decode_raw(register, out_type=tf.float32)

raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(map_func=convert_binary_to_float_array)

Now, I need to create 4 datasets train_data, train_labels, test_data, test_labels as follows:

train_data, train_labels, test_data, test_labels = prepare_ds(raw_dataset, 0.8)

and use them to train & evaluate:

model = build_model()

history = model.fit(train_data, train_labels, ...)

loss, mse = model.evaluate(test_data, test_labels)

My question is: how to implement function prepare_ds(dataset, frac)?

def prepare_ds(dataset, frac):
    ...

I have tried to use tf.shape, tf.reshape, tf.slice, subscription [:] with no success. I realized that those functions doesn't work properly because after the map() call raw_dataset is a MapDataset (as a result of the eager execution concerns).


Solution

  • If the meta-data is suppose to be part of your inputs, which I am assuming, you could try something like this:

    import random
    import struct
    import tensorflow as tf
    import numpy as np
    
    RAW_N = 2 + 20*20 + 1
    
    bytess = random.sample(range(1, 5000), RAW_N*4)
    with open('mydata.bin', 'wb') as f:
      f.write(struct.pack('1612i', *bytess))
    
    def decode_and_prepare(register):
      register = tf.io.decode_raw(register, out_type=tf.float32)
      inputs = register[:402]
      label = register[402:]
      return inputs, label
    
    total_data_entries = 8
    raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['/content/mydata.bin', '/content/mydata.bin'], record_bytes=RAW_N*4)
    raw_dataset = raw_dataset.map(decode_and_prepare)
    raw_dataset = raw_dataset.shuffle(buffer_size=total_data_entries)
    
    train_ds_size = int(0.8 * total_data_entries)
    test_ds_size = int(0.2 * total_data_entries)
    
    train_ds = raw_dataset.take(train_ds_size)
    remaining_data = raw_dataset.skip(train_ds_size)  
    test_ds = remaining_data.take(test_ds_size)
    

    Note that I am using the same bin file twice for demonstration purposes. After running that code snippet, you could feed the datasets to your model like this:

    model = build_model()
    
    history = model.fit(train_ds, ...)
    
    loss, mse = model.evaluate(test_ds)
    

    as each dataset contains the inputs and the corresponding labels.