I'm following the huggingface tutorial here and it's giving me a strange error. When I run the following code:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader
raw_datasets = load_dataset("glue", "mrpc")
Here is what I see:
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 151k/151k [00:00<00:00, 3.35MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.1k/11.1k [00:00<00:00, 6.63MB/s]
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:32<00:00, 10.89s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 127.92it/s]
Traceback (most recent call last):
File "/Users/ameenizhac/Downloads/transformers_playground.py", line 5, in <module>
raw_datasets = load_dataset("glue", "mrpc")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/load.py", line 1782, in load_dataset
builder_instance.download_and_prepare(
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/builder.py", line 967, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/builder.py", line 1709, in _prepare_split
split_info = self.info.splits[split_generator.name]
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/splits.py", line 530, in __getitem__
instructions = make_file_instructions(
^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/arrow_reader.py", line 112, in make_file_instructions
name2filenames = {
^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/arrow_reader.py", line 113, in <dictcomp>
info.name: filenames_for_dataset_split(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/naming.py", line 70, in filenames_for_dataset_split
prefix = filename_prefix_for_split(dataset_name, split)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/naming.py", line 54, in filename_prefix_for_split
if os.path.basename(name) != name:
^^^^^^^^^^^^^^^^^^^^^^
File "<frozen posixpath>", line 142, in basename
TypeError: expected str, bytes or os.PathLike object, not NoneType
I don't know where to start because I don't understand where the error is coming from.
I tried on my PC and on Google Colab. The strange thing is that on Colab it works, on my PC it does not.
Anyway, a possible workaround is the following:
raw_datasets = load_dataset("SetFit/mrpc")
If you print it, you will see that the dataset is the same, it just has a different name:
DatasetDict({
train: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 3668
})
test: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 1725
})
validation: Dataset({
features: ['text1', 'text2', 'label', 'idx', 'label_text'],
num_rows: 408
})
})