Search code examples
python-3.xcsvtensorflowtensorflow2.0tensorflow-datasets

How to find size or shape of an tensorflow.python.data.ops.dataset_ops.MapDataset object, output of make_csv_dataset


Using make_csv_dataset we could read an CSV file to tensorflow dataset object

csv_data = tf.data.experimental.make_csv_dataset(
    "./train.csv",
    batch_size=8190,
    num_epochs=1,
    ignore_errors=True,)

now csv_data is of type tensorflow.python.data.ops.dataset_ops.MapDataset. How can I find the size or shape of csv_data.

print(csv_data) give column information as below

<MapDataset element_spec={'title': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'user_id': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>

of course getting the could from train_recom.csv using and pandas.read_csv is on option, just was curious if tensorflow has anything easier.


Solution

  • If you want to get the size of your batched dataset without any preprocessing steps, try:

    import pandas as pd
    import tensorflow as tf
    
    df = pd.DataFrame(data={'A': [50.1, 1.23, 4.5, 4.3, 3.2], 'B':[50.1, 1.23, 4.5, 4.3, 3.2], 'C':[5.2, 3.1, 2.2, 1., 3.]})
    
    df.to_csv('data1.csv', index=False)
    df.to_csv('data2.csv', index=False)
    
    dataset = tf.data.experimental.make_csv_dataset(
        "/content/*.csv",
        batch_size=2,
        field_delim=",",
        num_epochs=1,
        select_columns=['A', 'B', 'C'],
        label_name='C')
    
    dataset_len = len(list(dataset.map(lambda x, y: (x, y))))
    print(dataset_len)
    
    5
    

    If you want to know how many samples you have altogether, try unbatch:

    dataset_len = len(list(dataset.unbatch().map(lambda x, y: (x, y))))
    print(dataset_len)
    # 10