Read CSV file using tf.data is very slow, use tfrecords instead?

I have a lot of CSV files with each record containing ~6000 columns. The first column is the label and the remaining columns should be treated as a feature vector. I'm new to Tensorflow and I can't figure out how to read the data into a Tensorflow Dataset with the desired format. I have the following code running currently:

DEFAULTS = []
n_features = 6170
for i in range(n_features+1):
  DEFAULTS.append([0.0])

def parse_csv(line):
    # line = line.replace('"', '')
    columns = tf.decode_csv(line, record_defaults=DEFAULTS)  # take a line at a time
    features = {'label': columns[-1], 'x': tf.stack(columns[:-1])}  # create a dictionary out of the features
    labels = features.pop('label')  # define the label

    return features, labels


def train_input_fn(data_file=sample_csv_file, batch_size=128):
    """Generate an input function for the Estimator."""
    # Extract lines from input files using the Dataset API.
    dataset = tf.data.TextLineDataset(data_file)
    dataset = dataset.map(parse_csv)
    dataset = dataset.shuffle(10000).repeat().batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()

Each CSV file has ~10K records. I've tried to do a sample eval on train_input_fn as labels = train_input_fn()[1].eval(session=sess). This gets 128 labels, but it's taking around 2 minutes.

Am I using some redundant operations or is there any better way to do this?

PS: I have the original data in Spark Dataframe. Hence, I can use TFRecords as well if that can make things faster.

Solution

You are doing it right. But a faster way is to use TFRecords as shown in the following steps:

Use tf.python_io.TFRecordWriter: -- To read the csv file and write it as a tfrecord file as shown here:Tensorflow create a tfrecords file from csv.

Reading from the tfrecord : --

def _parse_function(proto):
   f = {
       "features": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True),
       "label": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True)
       }
       parsed_features = tf.parse_single_example(proto, f)
       features = parsed_features["features"]
       label = parsed_features["label"]
       return features, label


dataset = tf.data.TFRecordDataset(['csv.tfrecords'])
dataset = dataset.map(_parse_function)
dataset = dataset.shuffle(10000).repeat().batch(128)
iterator = dataset.make_one_shot_iterator()
features, label = iterator.get_next()

I ran both the cases (csv vs tfrecords) on a randomly generated csv. The total time for 10 batches (128 samples each) for a csv direct read was around 204s, while that of tfrecord was around 0.22s.