I downloaded a dataset hosted on HuggingFace via the HuggingFace CLI as follows:
pip install huggingface_hub[hf_transfer]
huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --local-dir-use-symlinks False
However, the downloaded files don't have their original filenames. Instead, their hashes (git-sha or sha256, depending on whether they’re LFS files) are used as filenames:
--- /home/dernonco/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/blobs ---------------------------------------------
/..
12.9 GiB [##########] b581945ddee5e673fa2059afb25274b1523f270687b5253cb8aa72865760ebc0
3.9 GiB [### ] 86ebd2861a42b27168d75f346dd72f0e2b9eaee0afb90890beff15d025af45c6
3.9 GiB [## ] f9b81739ee30450b930390e1155e2cdea1b3063379ba6fd9253513eba1ab1e05
3.7 GiB [## ] e54c7d123ad93f4144eebdca2827ef81ea1ac282ddd2243386528cd157c02f36
3.7 GiB [## ] 736e225a7dd38a7987d0745b1b2f545ab701cfdf1f639874f5743b5bfb5cb1e1
3.7 GiB [## ] 0687246c92ec87b54e1c5fe623a77b650c02e6884e17a6f0fb4052a862d928d0
3.6 GiB [## ] 2becb5f9878b95f1b12622f50868f5855221985f05910d7cc759e6be074e6b8e
3.5 GiB [## ] 2208068c69b39c46ee9fac862da3c060c58b61adcaee1b3e6aa5d6d5dd3eba86
3.5 GiB [## ] caf87e71232cbb8a31960a26ba30b9412c15893c831ef118196c581cfd3a3779
3.4 GiB [## ] dc88cbf0ef45351bdc1f53c4396466d3e79874803719e266630ed6c3ad911d6a
3.4 GiB [## ] f05f7fb3b55b6840ebc4ada5daa28742bbae6ad4dcc35781dc811024f27a1b4e
3.4 GiB [## ] 88bd831618b36330ef5cd84b7ccbc4d5f3f55955c0b223208bc2244b27fb2d78
3.4 GiB [## ] bf80943b3389ddbeb8fb8a56af2d7fa5d09c5af076aac93f54ad921ee382c77d
3.3 GiB [## ] 83b2627e644c9ad0486e3bd966b02f014722e668d26b9d52394c974fcf2fdcf8
3.2 GiB [## ] e52e7b086dabd431b25cf309e1fe513190543e058f4e7a2d8e05b22821ded4fe
3.2 GiB [## ] 4fe583348f3ac118f34c7b93b6a187ba4e21a5a7f5b6ca1a6adbce1cc6d563a9
3.2 GiB [## ] ae6b6faca3bbd75e7ca99ccf20b55b017393bf09022efb8459293afffe06dc6e
3.1 GiB [## ] 5865379a894f8dc40703bdc1093d45fda67d5e1a742a2eebddd37e1a00f067fd
3.1 GiB [## ] cd346324b29390a589926ccab7187ae818cf5f9fcbaf8ecc95313e6cdfab86bc
3.0 GiB [## ] 914eb2b1174a662e3faebac82f6b5591a54def39a9d3a7e5ab2347ecc87a982f
2.9 GiB [## ] 24789f33332e8539b2ee72a0a489c0f4d0c6103f7f9600de660d78543ade9111
2.9 GiB [## ] 35e8da5f831b36416c9569014c58f881a0a30c00db9f3caae0d7db6a8fd3c694
2.8 GiB [## ] d5127e0298661d40a343d58759ed6298f9d2ef02d5c4f6a30bd9e07bc5423317
2.8 GiB [## ] 1b4e1951da2462ca77d94d220a58c97f64caa2b2defe4df95feed9defcee6ca7
2.8 GiB [## ] 75a4725625c095d98ecef7d68d384d7b1201ace046ef02ed499776b0ac02b61e
2.8 GiB [## ] fefbbc3e87be522b7e571c78a188aba35bd5d282cf8f41257097a621af64ff60
Total disk usage: 184.8 GiB Apparent size: 184.8 GiB Items: 85
How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?
I met the same problem, and wrote a python script to handle this problem.
For example, I download the naver-clova-ix/synthdog-en dataset by:
$ huggingface-cli download --repo-type dataset --resume-download naver-clova-ix/synthdog-en --local-dir synthdog-en
The synthdog-en directory structure is as follows:
synthdog-en
├── README.md
├── data
│ ├── train-00000-of-00084-26dbc51f3d0903b9.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/9d0260e08cb5a4f9c14fa794465bcb66fae6ef7ccc2f6d7ef20efa44810c0648
│ ├── train-00001-of-00084-3efa94914043c815.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/04441e203ff713743c0c9a1009f71f97e47bc4d7b2c9313f4fcfa9c3e73b20e3
│ ├── ...
│ └── validation-00000-of-00001-394e0bd4c5ebec42.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/4e5f27b7a976041855d80eb07680de4ea014be07a494f40b246058dfce46d44b
└── dataset_infos.json
The full python script code is as follows:
import shutil
from pathlib import Path
from tqdm import tqdm
def cp_symlink_file_to_dst(file_path: Path, dst_dir: Path):
if not file_path.is_symlink():
return
real_file_path = file_path.readlink()
real_file_path = Path.home() / str(real_file_path).rpartition("../")[-1]
real_file_name = file_path.name
dst_file_path = Path(dst_dir) / real_file_name
shutil.copy(real_file_path, dst_file_path)
if __name__ == "__main__":
data_dir = Path("data")
data_paths = list(data_dir.glob("*.parquet"))
dst_dir = Path("output")
dst_dir.mkdir(parents=True, exist_ok=True)
for file_path in tqdm(data_paths):
cp_symlink_file_to_dst(file_path, dst_dir)
The output directory is as follows:
output
├── train-00000-of-00084-26dbc51f3d0903b9.parquet
├── train-00001-of-00084-3efa94914043c815.parquet
├── ...
├── train-00083-of-00084-5e6bb79e23f90f3b.parquet
└── validation-00000-of-00001-394e0bd4c5ebec42.parquet