Search code examples
tensorflowapache-beamtensorflow-datasetstfrecordtfx

Best practices to use .tfrecord files for forecasting


What are the best practices to store/read data to/from TFRecord files to train a forecasting model? I want to build a model that can predict the health of individual machines (for example, an electric motor) based on their historical health data (for example, the historical data from a fleet of motors including each motor speed, error rate, breakdown, etc).

I can do the entire preprocessing (normalize the data, impute missing values, engineer new features, split to train/validate/test sets, etc) with Apache Beam/Dataflow. But I was thinking maybe it'd be better to store the raw data as .tfrecord files and use TFX to do the normalization, imputation, etc to make experimentation easier. TFX tensorflow_transform currently doesn't support tf.SequenceExample files. Therefore, I was thinking to store the raw data as tf.Example files with each record in the following format:

example_proto = tf.train.Example(features=tf.train.Features(feature={
    'timestamp': tf.train.Feature(int64_list=tf.train.Int64List(
        value=[1601200000, 1601200060, 1601200120, ...])),
    'feature0': tf.train.Feature(float_list=tf.train.FloatList(
        value=[np.nan, 15523.0, np.nan, ...])),
    'feature1': tf.train.Feature(float_list=tf.train.FloatList(
        value=[1.0, -8.0, np.nan, ...])),
    ...
    'label': tf.train.Feature(float_list=tf.train.FloatList(
        value=[0.5, -10.3, 2.1, ...])),
}))

What do you think? Any tips?


Solution

  • TFX 0.23.0 has added support for TF.ExampleSequence in some components.

    You can also make use of TF.Example using the list in the way you describe. If you need to feed a sequence to your model based on your TF.Example you will need to use TF.transform to stack and reshape the values read in.

    float32 = tf.reshape(
            tf.stack(...),
            [-1, timesteps, features)])