I have a lot of CSV files with each record containing ~6000 columns. The first column is the label and the remaining columns should be treated as a feature vector. I'm new to Tensorflow and I can't figure out how to read the data into a Tensorflow Dataset
with the desired format. I have the following code running currently:
DEFAULTS = []
n_features = 6170
for i in range(n_features+1):
DEFAULTS.append([0.0])
def parse_csv(line):
# line = line.replace('"', '')
columns = tf.decode_csv(line, record_defaults=DEFAULTS) # take a line at a time
features = {'label': columns[-1], 'x': tf.stack(columns[:-1])} # create a dictionary out of the features
labels = features.pop('label') # define the label
return features, labels
def train_input_fn(data_file=sample_csv_file, batch_size=128):
"""Generate an input function for the Estimator."""
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
dataset = dataset.map(parse_csv)
dataset = dataset.shuffle(10000).repeat().batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
Each CSV file has ~10K records. I've tried to do a sample eval on train_input_fn
as labels = train_input_fn()[1].eval(session=sess)
. This gets 128 labels, but it's taking around 2 minutes.
Am I using some redundant operations or is there any better way to do this?
PS: I have the original data in Spark Dataframe. Hence, I can use TFRecords as well if that can make things faster.
You are doing it right. But a faster way is to use TFRecords
as shown in the following steps:
Use tf.python_io.TFRecordWriter
:
--
To read the csv file and write it as a tfrecord file as shown here:Tensorflow create a tfrecords file from csv.
Reading from the tfrecord : --
def _parse_function(proto):
f = {
"features": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True),
"label": tf.FixedLenSequenceFeature([], tf.float32, default_value=0.0, allow_missing=True)
}
parsed_features = tf.parse_single_example(proto, f)
features = parsed_features["features"]
label = parsed_features["label"]
return features, label
dataset = tf.data.TFRecordDataset(['csv.tfrecords'])
dataset = dataset.map(_parse_function)
dataset = dataset.shuffle(10000).repeat().batch(128)
iterator = dataset.make_one_shot_iterator()
features, label = iterator.get_next()
I ran both the cases (csv vs tfrecords)
on a randomly generated csv. The total time for 10 batches (128 samples each) for a csv direct read was around 204s
, while that of tfrecord was around 0.22s
.