I have a csv file containing a large number N
of columns: the first column contains the label, the other N-1
a numeric representation of my data (Chroma features from a music recording).
My idea is to represent the input data as an array. In practice, I want an equivalent of the standard representation of data in computer vision. Since my data is stored in a csv, inside the definition of the input train function, I need to a csv parser. I do it in this way
def parse_csv(line):
columns = tf.decode_csv(line, record_defaults=DEFAULTS) # take a line at a time
features = {'songID': columns[0], 'x': columns[1:]} # create a dictionary out of the features
labels = features.pop('songID') # define the label
return features, labels
def train_input_fn(data_file=fp, batch_size=128):
"""Generate an input function for the Estimator."""
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
dataset = dataset.map(parse_csv)
dataset = dataset.shuffle(1_000_000).repeat().batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
However, this returns an error that is not very significative: AttributeError: 'list' object has no attribute 'get_shape'
. I know that the culprit is the definition of x
inside the features dictionary, but I don't know how to correct it because, fundamentally, I don't really grok the data structures of tensorflow yet.
As it turns out, features need to be tensors. However, each column is a tensor in itself and taking columns[1:]
results in a list of tensors. For creating a higher-dimensional tensor that stores the information from N-1
columns one should use tf.stack
:
features = {'songID': columns[0], 'x': tf.stack(columns[1:])} # create a dictionary out of the features