tensorflow apache-beam tensorflow-datasets tfrecord tfx

Best practices to use .tfrecord files for forecasting

What are the best practices to store/read data to/from TFRecord files to train a forecasting model? I want to build a model that can predict the health of individual machines (for example, an electric motor) based on their historical health data (for example, the historical data from a fleet of motors including each motor speed, error rate, breakdown, etc).

I can do the entire preprocessing (normalize the data, impute missing values, engineer new features, split to train/validate/test sets, etc) with Apache Beam/Dataflow. But I was thinking maybe it'd be better to store the raw data as .tfrecord files and use TFX to do the normalization, imputation, etc to make experimentation easier. TFX tensorflow_transform currently doesn't support tf.SequenceExample files. Therefore, I was thinking to store the raw data as tf.Example files with each record in the following format:

example_proto = tf.train.Example(features=tf.train.Features(feature={
    'timestamp': tf.train.Feature(int64_list=tf.train.Int64List(
        value=[1601200000, 1601200060, 1601200120, ...])),
    'feature0': tf.train.Feature(float_list=tf.train.FloatList(
        value=[np.nan, 15523.0, np.nan, ...])),
    'feature1': tf.train.Feature(float_list=tf.train.FloatList(
        value=[1.0, -8.0, np.nan, ...])),
    ...
    'label': tf.train.Feature(float_list=tf.train.FloatList(
        value=[0.5, -10.3, 2.1, ...])),
}))

What do you think? Any tips?

Solution

TFX 0.23.0 has added support for TF.ExampleSequence in some components.

You can also make use of TF.Example using the list in the way you describe. If you need to feed a sequence to your model based on your TF.Example you will need to use TF.transform to stack and reshape the values read in.

float32 = tf.reshape(
        tf.stack(...),
        [-1, timesteps, features)])