I've been experimenting with the Kaggle API in Google Colab for a while now and I'm stuck with the following problem. I'm able to easily authenticate my credentials, and got no problem downloading whole datasets, as well as specific files using:
!kaggle datasets download -d <user>/<dataset>
!kaggle datasets download <user>/<dataset> -f <specific_file>
However, I'm not able to get the list of all the files in a dataset (which I would like to save in a variable).
Whenever I'm using:
api.dataset_list_files('<user>/<dataset>').files
I'm getting a list with blank spaces equal to the number of files in the respective dataset. I didn't find a mention to anything like that in the internet, so I guess that maybe it should be a recent bug/problem. In addition, I can actually use:
!kaggle datasets files <user>/<dataset>
To correctly list the first 20 files, but it isn't very helpful, as I don't know how to see the rest nor how to save it in a variable.
I suppose that maybe I can come up with a complex solution that employs Selenium or something like that, but I think that would a bit of an overkill. That's why I come here in search of the wisdom of more seasoned Kaggle API users, or someone who has also faced and solved this problem. Could you help me, please?
I installed kaggle
on local computer and checked it with option --help
$ kaggle datasets files --help
usage: kaggle datasets files [-h] [-v] [--page-token PAGE_TOKEN] [--page-size PAGE_SIZE] [dataset]
options:
-h, --help show this help message and exit
dataset Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)
-v, --csv Print results in CSV format (if not set print in table format)
--page-token PAGE_TOKEN
Page token for results paging.
--page-size PAGE_SIZE
Number of items to show on a page. Default size is 20, max is 200
It shows that default value is 20 items on page
but you can use --page-size 200
to get max 200 files at once.
If dataset has more files then it shows PAGE_TOKEN
to load next page.
$ kaggle datasets files kaggle/meta-kaggle
Next Page Token = CfDJ8CHCUm6ypKVLpjizcZHPE70-HT6X7bGt2XVG4i4n1JtDeW1lGdfJq1hmMK_AMcY_oT42rtD7r0_qjltw2PYh-F8
name size creationDate
----------------------------------- ----- -------------------
CompetitionTags.csv 23KB 2024-06-18 11:28:01
Competitions.csv 2MB 2024-06-18 11:28:02
DatasetTags.csv 9MB 2024-06-18 11:28:02
DatasetTaskSubmissions.csv 648KB 2024-06-18 11:28:02
DatasetTasks.csv 8MB 2024-06-18 11:28:02
DatasetVersions.csv 933MB 2024-06-18 11:28:21
DatasetVotes.csv 82MB 2024-06-18 11:28:09
Datasets.csv 41MB 2024-06-18 11:28:09
Datasources.csv 19MB 2024-06-18 11:28:08
EpisodeAgents.csv 12GB 2024-06-18 11:33:22
Episodes.csv 3GB 2024-06-18 11:30:10
ForumMessageVotes.csv 161MB 2024-06-18 11:29:37
ForumMessages.csv 1GB 2024-06-18 11:29:51
ForumTopics.csv 54MB 2024-06-18 11:29:36
Forums.csv 15MB 2024-06-18 11:29:35
KernelLanguages.csv 410B 2024-06-18 11:29:35
KernelTags.csv 23MB 2024-06-18 11:29:35
KernelVersionCompetitionSources.csv 76MB 2024-06-18 11:29:36
KernelVersionDatasetSources.csv 260MB 2024-06-18 11:29:41
KernelVersionKernelSources.csv 26MB 2024-06-18 11:29:35
And next page
$ kaggle datasets files kaggle/meta-kaggle --page-token 'CfDJ8CHCUm6ypKVLpjizcZHPE70-HT6X7bGt2XVG4i4n1JtDeW1lGdfJq1hmMK_AMcY_oT42rtD7r0_qjltw2PYh-F8'
name size creationDate
--------------------- ----- -------------------
KernelVersions.csv 2GB 2024-06-18 11:30:01
KernelVotes.csv 208MB 2024-06-18 11:29:38
Kernels.csv 177MB 2024-06-18 11:29:38
Organizations.csv 286KB 2024-06-18 11:29:35
Submissions.csv 2GB 2024-06-18 11:29:59
Tags.csv 96KB 2024-06-18 11:29:35
TeamMemberships.csv 318MB 2024-06-18 11:29:39
Teams.csv 574MB 2024-06-18 11:29:43
UserAchievements.csv 5GB 2024-06-18 11:30:26
UserFollowers.csv 61MB 2024-06-18 11:29:36
UserOrganizations.csv 81KB 2024-06-18 11:29:35
Users.csv 1GB 2024-06-18 11:29:53
If there would be more pages then second page should show token for third page, and third page should show token for fourth page, etc.
If you need it in variable then you can do directly
variable = !kaggle datasets files kaggle/meta-kaggle
it can be more useful if you send it in CSV format
variable = !kaggle datasets files --csv kaggle/meta-kaggle
because it gives list of lines so you can convert it to one string
and later you can use io
to load it to DataFrame
import pandas as pd
import io
text = "\n".join(variable) # use `[1:]` to skip line with PAGE TOKEN or `[2:]` to skip also header
df = pd.read_csv(io.StringIO(text))
print(df)
You may also redirect it to file
!kaggle datasets files --csv kaggle/meta-kaggle > output.csv
and next page you can append using >>
instead of >
!kaggle datasets files --csv kaggle/meta-kaggle --page-token ... >> output.csv
but it adds second header as row of data and it needs to remove it later.
Getting information with Python
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
data = []
# first page
page_token = None
#page_size = 20
while True:
print('loading page ...')
result = api.datasets_list_files('kaggle', 'meta-kaggle', page_token=page_token) #, page_size=page_size)
#print('-- result keys ---')
#print('\n'.join(sorted(result.keys())))
#page_token = result['nextPageToken'] # I'm not sure if key can exists if there is no next page and don't want to check key `hasNextPageToken`
page_token = result.get('nextPageToken')
for item in result['datasetFiles']:
#print('-- item keys ---')
#print('\n'.join(sorted(item.keys())))
data.append( [item['name'], item['totalBytes']] )
if not page_token:
break
print('len(data):', len(data))
for index, (name, size) in enumerate(data, 1):
print(f'{index:3} | {size:15,} | {name}')
Result:
loading page ...
loading page ...
len(data): 32
1 | 23,149 | CompetitionTags.csv
2 | 2,601,077 | Competitions.csv
3 | 9,251,082 | DatasetTags.csv
4 | 663,759 | DatasetTaskSubmissions.csv
5 | 7,918,950 | DatasetTasks.csv
6 | 978,416,044 | DatasetVersions.csv
7 | 85,737,991 | DatasetVotes.csv
8 | 42,800,784 | Datasets.csv
9 | 20,060,921 | Datasources.csv
10 | 13,291,567,707 | EpisodeAgents.csv
11 | 3,719,462,511 | Episodes.csv
12 | 169,295,986 | ForumMessageVotes.csv
13 | 1,365,573,428 | ForumMessages.csv
14 | 57,075,670 | ForumTopics.csv
15 | 15,698,393 | Forums.csv
16 | 410 | KernelLanguages.csv
17 | 24,133,789 | KernelTags.csv
18 | 79,198,822 | KernelVersionCompetitionSources.csv
19 | 272,485,582 | KernelVersionDatasetSources.csv
20 | 27,251,969 | KernelVersionKernelSources.csv
21 | 1,721,607,368 | KernelVersions.csv
22 | 217,656,945 | KernelVotes.csv
23 | 186,013,899 | Kernels.csv
24 | 293,159 | Organizations.csv
25 | 1,854,509,529 | Submissions.csv
26 | 98,002 | Tags.csv
27 | 333,276,608 | TeamMemberships.csv
28 | 601,950,035 | Teams.csv
29 | 5,596,699,151 | UserAchievements.csv
30 | 63,464,540 | UserFollowers.csv
31 | 83,301 | UserOrganizations.csv
32 | 1,088,130,308 | Users.csv