Search code examples
pythondownloaddatasethuggingface-datasets

How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?


I downloaded a dataset hosted on HuggingFace via the HuggingFace CLI as follows:

pip install huggingface_hub[hf_transfer]
huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --local-dir-use-symlinks False 

However, the downloaded files don't have their original filenames. Instead, their hashes (git-sha or sha256, depending on whether they’re LFS files) are used as filenames:

--- /home/dernonco/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/blobs ---------------------------------------------
                         /..                                                                                                       
   12.9 GiB [##########]  b581945ddee5e673fa2059afb25274b1523f270687b5253cb8aa72865760ebc0
    3.9 GiB [###       ]  86ebd2861a42b27168d75f346dd72f0e2b9eaee0afb90890beff15d025af45c6
    3.9 GiB [##        ]  f9b81739ee30450b930390e1155e2cdea1b3063379ba6fd9253513eba1ab1e05
    3.7 GiB [##        ]  e54c7d123ad93f4144eebdca2827ef81ea1ac282ddd2243386528cd157c02f36
    3.7 GiB [##        ]  736e225a7dd38a7987d0745b1b2f545ab701cfdf1f639874f5743b5bfb5cb1e1
    3.7 GiB [##        ]  0687246c92ec87b54e1c5fe623a77b650c02e6884e17a6f0fb4052a862d928d0
    3.6 GiB [##        ]  2becb5f9878b95f1b12622f50868f5855221985f05910d7cc759e6be074e6b8e
    3.5 GiB [##        ]  2208068c69b39c46ee9fac862da3c060c58b61adcaee1b3e6aa5d6d5dd3eba86
    3.5 GiB [##        ]  caf87e71232cbb8a31960a26ba30b9412c15893c831ef118196c581cfd3a3779
    3.4 GiB [##        ]  dc88cbf0ef45351bdc1f53c4396466d3e79874803719e266630ed6c3ad911d6a
    3.4 GiB [##        ]  f05f7fb3b55b6840ebc4ada5daa28742bbae6ad4dcc35781dc811024f27a1b4e
    3.4 GiB [##        ]  88bd831618b36330ef5cd84b7ccbc4d5f3f55955c0b223208bc2244b27fb2d78
    3.4 GiB [##        ]  bf80943b3389ddbeb8fb8a56af2d7fa5d09c5af076aac93f54ad921ee382c77d
    3.3 GiB [##        ]  83b2627e644c9ad0486e3bd966b02f014722e668d26b9d52394c974fcf2fdcf8
    3.2 GiB [##        ]  e52e7b086dabd431b25cf309e1fe513190543e058f4e7a2d8e05b22821ded4fe
    3.2 GiB [##        ]  4fe583348f3ac118f34c7b93b6a187ba4e21a5a7f5b6ca1a6adbce1cc6d563a9
    3.2 GiB [##        ]  ae6b6faca3bbd75e7ca99ccf20b55b017393bf09022efb8459293afffe06dc6e
    3.1 GiB [##        ]  5865379a894f8dc40703bdc1093d45fda67d5e1a742a2eebddd37e1a00f067fd
    3.1 GiB [##        ]  cd346324b29390a589926ccab7187ae818cf5f9fcbaf8ecc95313e6cdfab86bc
    3.0 GiB [##        ]  914eb2b1174a662e3faebac82f6b5591a54def39a9d3a7e5ab2347ecc87a982f
    2.9 GiB [##        ]  24789f33332e8539b2ee72a0a489c0f4d0c6103f7f9600de660d78543ade9111
    2.9 GiB [##        ]  35e8da5f831b36416c9569014c58f881a0a30c00db9f3caae0d7db6a8fd3c694
    2.8 GiB [##        ]  d5127e0298661d40a343d58759ed6298f9d2ef02d5c4f6a30bd9e07bc5423317
    2.8 GiB [##        ]  1b4e1951da2462ca77d94d220a58c97f64caa2b2defe4df95feed9defcee6ca7
    2.8 GiB [##        ]  75a4725625c095d98ecef7d68d384d7b1201ace046ef02ed499776b0ac02b61e
    2.8 GiB [##        ]  fefbbc3e87be522b7e571c78a188aba35bd5d282cf8f41257097a621af64ff60
 Total disk usage: 184.8 GiB  Apparent size: 184.8 GiB  Items: 85                                          

How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?


Solution

  • I met the same problem, and wrote a python script to handle this problem.

    For example, I download the naver-clova-ix/synthdog-en dataset by:

    $ huggingface-cli download --repo-type dataset --resume-download naver-clova-ix/synthdog-en --local-dir synthdog-en
    

    The synthdog-en directory structure is as follows:

    synthdog-en
    ├── README.md
    ├── data
    │   ├── train-00000-of-00084-26dbc51f3d0903b9.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/9d0260e08cb5a4f9c14fa794465bcb66fae6ef7ccc2f6d7ef20efa44810c0648
    │   ├── train-00001-of-00084-3efa94914043c815.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/04441e203ff713743c0c9a1009f71f97e47bc4d7b2c9313f4fcfa9c3e73b20e3
    │   ├── ...
    │   └── validation-00000-of-00001-394e0bd4c5ebec42.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/4e5f27b7a976041855d80eb07680de4ea014be07a494f40b246058dfce46d44b
    └── dataset_infos.json
    

    The full python script code is as follows:

    import shutil
    from pathlib import Path
    
    from tqdm import tqdm
    
    
    def cp_symlink_file_to_dst(file_path: Path, dst_dir: Path):
        if not file_path.is_symlink():
            return
    
        real_file_path = file_path.readlink()
        real_file_path = Path.home() / str(real_file_path).rpartition("../")[-1]
    
        real_file_name = file_path.name
    
        dst_file_path = Path(dst_dir) / real_file_name
    
        shutil.copy(real_file_path, dst_file_path)
    
    
    if __name__ == "__main__":
        data_dir = Path("data")
        data_paths = list(data_dir.glob("*.parquet"))
    
        dst_dir = Path("output")
        dst_dir.mkdir(parents=True, exist_ok=True)
        for file_path in tqdm(data_paths):
            cp_symlink_file_to_dst(file_path, dst_dir)
    

    The output directory is as follows:

    output
    ├── train-00000-of-00084-26dbc51f3d0903b9.parquet
    ├── train-00001-of-00084-3efa94914043c815.parquet
    ├── ...
    ├── train-00083-of-00084-5e6bb79e23f90f3b.parquet
    └── validation-00000-of-00001-394e0bd4c5ebec42.parquet