python tensorflow tensorflow2.0 tensorflow-datasets

Read a list of CSV files and make a dataset in tensorflow

I am quite new to TensorFlow.

I have the this dataset which is available on kaggle. I wanted to read only the files from 2018 which are available in the raw directory. I can list the files using tensorflow in the following manner:

import tensorflow as tf

data_2018 = tf.data.Dataset.list_files("./raw/*2018*")

However, this does not loads the data. Plus I want to choose the columns which should be loaded. For example I would like to load [1, 3, 6, 8, 10] columns. How can I load the data from multiple CSV files and also choose the columns?

Solution

Try using tf.data.experimental.make_csv_dataset:

import pandas as pd
import tensorflow as tf

# Create dummy data
df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
                   'mask': ['red', 'purple'],
                   'weapon': ['sai', 'bo staff']})
df.to_csv("/content/raw/2_2018_2.csv", index=False)
df.to_csv("/content/raw/2_2018_3.csv", index=False)

Load csv files and select specific columns:

dataset = tf.data.experimental.make_csv_dataset(file_pattern = "/content/raw/*2018*", batch_size=2, num_epochs=1, select_columns = ['name', 'mask'])
for x in dataset:
  print(x['name'], x['mask'])

tf.Tensor([b'Donatello' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'red'], shape=(2,), dtype=string)
tf.Tensor([b'Donatello' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'red'], shape=(2,), dtype=string)
tf.Tensor([b'Raphael' b'Raphael'], shape=(2,), dtype=string) tf.Tensor([b'red' b'red'], shape=(2,), dtype=string)
tf.Tensor([b'Donatello' b'Donatello'], shape=(2,), dtype=string) tf.Tensor([b'purple' b'purple'], shape=(2,), dtype=string)