Search code examples
pythoncsvtensorflowdata-import

Import csv row as array in tensorflow


I have a csv file containing a large number N of columns: the first column contains the label, the other N-1 a numeric representation of my data (Chroma features from a music recording).

My idea is to represent the input data as an array. In practice, I want an equivalent of the standard representation of data in computer vision. Since my data is stored in a csv, inside the definition of the input train function, I need to a csv parser. I do it in this way

def parse_csv(line):
    columns = tf.decode_csv(line, record_defaults=DEFAULTS)  # take a line at a time
    features = {'songID': columns[0], 'x': columns[1:]}  # create a dictionary out of the features
    labels = features.pop('songID')  # define the label
    return features, labels


def train_input_fn(data_file=fp, batch_size=128):
    """Generate an input function for the Estimator."""

    # Extract lines from input files using the Dataset API.
    dataset = tf.data.TextLineDataset(data_file)
    dataset = dataset.map(parse_csv)
    dataset = dataset.shuffle(1_000_000).repeat().batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()

However, this returns an error that is not very significative: AttributeError: 'list' object has no attribute 'get_shape'. I know that the culprit is the definition of x inside the features dictionary, but I don't know how to correct it because, fundamentally, I don't really grok the data structures of tensorflow yet.


Solution

  • As it turns out, features need to be tensors. However, each column is a tensor in itself and taking columns[1:] results in a list of tensors. For creating a higher-dimensional tensor that stores the information from N-1 columns one should use tf.stack:

    features = {'songID': columns[0], 'x': tf.stack(columns[1:])}  # create a dictionary out of the features