Search code examples
pythontensorflowtensorflow2.0tensorflow-datasets

Read a list of CSV files and make a dataset in tensorflow


I am quite new to TensorFlow.

I have the this dataset which is available on kaggle. I wanted to read only the files from 2018 which are available in the raw directory. I can list the files using tensorflow in the following manner:

import tensorflow as tf

data_2018 = tf.data.Dataset.list_files("./raw/*2018*")

However, this does not loads the data. Plus I want to choose the columns which should be loaded. For example I would like to load [1, 3, 6, 8, 10] columns. How can I load the data from multiple CSV files and also choose the columns?


Solution

  • Try using tf.data.experimental.make_csv_dataset:

    import pandas as pd
    import tensorflow as tf
    
    # Create dummy data
    df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
                       'mask': ['red', 'purple'],
                       'weapon': ['sai', 'bo staff']})
    df.to_csv("/content/raw/2_2018_2.csv", index=False)
    df.to_csv("/content/raw/2_2018_3.csv", index=False)
    

    Load csv files and select specific columns:

    dataset = tf.data.experimental.make_csv_dataset(file_pattern = "/content/raw/*2018*", batch_size=2, num_epochs=1, select_columns = ['name', 'mask'])
    for x in dataset:
      print(x['name'], x['mask'])
    
    tf.Tensor([b'Donatello' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'red'], shape=(2,), dtype=string)
    tf.Tensor([b'Donatello' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'red'], shape=(2,), dtype=string)
    tf.Tensor([b'Raphael' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'red' b'red'], shape=(2,), dtype=string)
    tf.Tensor([b'Donatello' b'Donatello'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'purple'], shape=(2,), dtype=string)