Search code examples
pythontensorflowtensorflow-datasets

Read CSV file using tf.data is very slow, use tfrecords instead?


I have a lot of CSV files with each record containing ~6000 columns. The first column is the label and the remaining columns should be treated as a feature vector. I'm new to Tensorflow and I can't figure out how to read the data into a Tensorflow Dataset with the desired format. I have the following code running currently:

DEFAULTS = []
n_features = 6170
for i in range(n_features+1):
  DEFAULTS.append([0.0])

def parse_csv(line):
    # line = line.replace('"', '')
    columns = tf.decode_csv(line, record_defaults=DEFAULTS)  # take a line at a time
    features = {'label': columns[-1], 'x': tf.stack(columns[:-1])}  # create a dictionary out of the features
    labels = features.pop('label')  # define the label

    return features, labels


def train_input_fn(data_file=sample_csv_file, batch_size=128):
    """Generate an input function for the Estimator."""
    # Extract lines from input files using the Dataset API.
    dataset = tf.data.TextLineDataset(data_file)
    dataset = dataset.map(parse_csv)
    dataset = dataset.shuffle(10000).repeat().batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()

Each CSV file has ~10K records. I've tried to do a sample eval on train_input_fn as labels = train_input_fn()[1].eval(session=sess). This gets 128 labels, but it's taking around 2 minutes.

Am I using some redundant operations or is there any better way to do this?

PS: I have the original data in Spark Dataframe. Hence, I can use TFRecords as well if that can make things faster.


Solution

  • You are doing it right. But a faster way is to use TFRecords as shown in the following steps:

    1. Use tf.python_io.TFRecordWriter: -- To read the csv file and write it as a tfrecord file as shown here:Tensorflow create a tfrecords file from csv.

    2. Reading from the tfrecord : --

      def _parse_function(proto):
         f = {
             "features": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True),
             "label": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True)
             }
             parsed_features = tf.parse_single_example(proto, f)
             features = parsed_features["features"]
             label = parsed_features["label"]
             return features, label
      
      
      dataset = tf.data.TFRecordDataset(['csv.tfrecords'])
      dataset = dataset.map(_parse_function)
      dataset = dataset.shuffle(10000).repeat().batch(128)
      iterator = dataset.make_one_shot_iterator()
      features, label = iterator.get_next()
      

    I ran both the cases (csv vs tfrecords) on a randomly generated csv. The total time for 10 batches (128 samples each) for a csv direct read was around 204s, while that of tfrecord was around 0.22s.