Search code examples
huggingface-datasets

HuggingFace Dataset with 4 custom splits?


I have 4 JSON containing the same structure and the same amount of data (but the contents are different). I upload 4 files to my HuggingFace dataset repository.

My first try

Uploaded 4 files to repository directory:

  • data-1.json
  • data-2.json
  • data-3.json
  • data-4.json

As a result, HuggingFace combined 4 files into 1 dataset.

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 551964
    })
})

My second try

I renamed the 4 files into alpha.json, beta.json, delta.json, gamma.json. The result is the same.

My third try

I put the 4 files into 4 folders:

  • alpha/data-1.json
  • beta/data-2.json
  • delta/data-3.json
  • gamma/data-4.json

The result is still the same.

According to this official documentation, it only recognizes certain file & folder patterns.

My goal is to load my dataset like this, with 4 custom splits:

ds = load_dataset("myusername/my-dataset")
print(ds)

and the output is:

DatasetDict({
    alpha: Dataset({ # loads data-1.json
        features: ['translation'],
        num_rows: 137991
    }),
    beta: Dataset({ # loads data-2.json
        features: ['translation'],
        num_rows: 137991
    }),
    delta: Dataset({ # loads data-3.json
        features: ['translation'],
        num_rows: 137991
    }),
    gamma: Dataset({ # loads data-4.json
        features: ['translation'],
        num_rows: 137991
    })
})

The only stupid way I can think of is to create 4 dataset repositories, which is uneasy to manage.


Solution

  • Just found out that I need a special folder and file naming pattern to achieve my goal:

    my_repository/
    ├── README.md
    └── data/
        ├── alpha-00000-of-00001.csv
        ├── beta-00000-of-00001.csv
        ├── delta-00000-of-00001.csv
        ├── gamma-00000-of-00001.csv
    

    which the load_dataset() function will result:

    DatasetDict({
        alpha: Dataset({
            features: ['translation'],
            num_rows: 137991
        })
        beta: Dataset({
            features: ['translation'],
            num_rows: 137991
        })
        delta: Dataset({
            features: ['translation'],
            num_rows: 137991
        })
        gamma: Dataset({
            features: ['translation'],
            num_rows: 137991
        })
    })