python dataset google-colaboratory kaggle

How to list all files from a Kaggle dataset in Google Colab?

I've been experimenting with the Kaggle API in Google Colab for a while now and I'm stuck with the following problem. I'm able to easily authenticate my credentials, and got no problem downloading whole datasets, as well as specific files using:

!kaggle datasets download -d <user>/<dataset>
!kaggle datasets download <user>/<dataset> -f <specific_file>

However, I'm not able to get the list of all the files in a dataset (which I would like to save in a variable).

Whenever I'm using:

api.dataset_list_files('<user>/<dataset>').files

I'm getting a list with blank spaces equal to the number of files in the respective dataset. I didn't find a mention to anything like that in the internet, so I guess that maybe it should be a recent bug/problem. In addition, I can actually use:

!kaggle datasets files <user>/<dataset>

To correctly list the first 20 files, but it isn't very helpful, as I don't know how to see the rest nor how to save it in a variable.

I suppose that maybe I can come up with a complex solution that employs Selenium or something like that, but I think that would a bit of an overkill. That's why I come here in search of the wisdom of more seasoned Kaggle API users, or someone who has also faced and solved this problem. Could you help me, please?

Solution

I installed kaggle on local computer and checked it with option --help

$ kaggle datasets files --help

usage: kaggle datasets files [-h] [-v] [--page-token PAGE_TOKEN] [--page-size PAGE_SIZE] [dataset]

options:
  -h, --help            show this help message and exit
  dataset               Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)
  -v, --csv             Print results in CSV format (if not set print in table format)
  --page-token PAGE_TOKEN
                        Page token for results paging.
  --page-size PAGE_SIZE
                        Number of items to show on a page. Default size is 20, max is 200

It shows that default value is 20 items on page
but you can use --page-size 200 to get max 200 files at once.

If dataset has more files then it shows PAGE_TOKEN to load next page.

$ kaggle datasets files kaggle/meta-kaggle

Next Page Token = CfDJ8CHCUm6ypKVLpjizcZHPE70-HT6X7bGt2XVG4i4n1JtDeW1lGdfJq1hmMK_AMcY_oT42rtD7r0_qjltw2PYh-F8
name                                  size  creationDate         
-----------------------------------  -----  -------------------  
CompetitionTags.csv                   23KB  2024-06-18 11:28:01  
Competitions.csv                       2MB  2024-06-18 11:28:02  
DatasetTags.csv                        9MB  2024-06-18 11:28:02  
DatasetTaskSubmissions.csv           648KB  2024-06-18 11:28:02  
DatasetTasks.csv                       8MB  2024-06-18 11:28:02  
DatasetVersions.csv                  933MB  2024-06-18 11:28:21  
DatasetVotes.csv                      82MB  2024-06-18 11:28:09  
Datasets.csv                          41MB  2024-06-18 11:28:09  
Datasources.csv                       19MB  2024-06-18 11:28:08  
EpisodeAgents.csv                     12GB  2024-06-18 11:33:22  
Episodes.csv                           3GB  2024-06-18 11:30:10  
ForumMessageVotes.csv                161MB  2024-06-18 11:29:37  
ForumMessages.csv                      1GB  2024-06-18 11:29:51  
ForumTopics.csv                       54MB  2024-06-18 11:29:36  
Forums.csv                            15MB  2024-06-18 11:29:35  
KernelLanguages.csv                   410B  2024-06-18 11:29:35  
KernelTags.csv                        23MB  2024-06-18 11:29:35  
KernelVersionCompetitionSources.csv   76MB  2024-06-18 11:29:36  
KernelVersionDatasetSources.csv      260MB  2024-06-18 11:29:41  
KernelVersionKernelSources.csv        26MB  2024-06-18 11:29:35

And next page

$ kaggle datasets files kaggle/meta-kaggle --page-token 'CfDJ8CHCUm6ypKVLpjizcZHPE70-HT6X7bGt2XVG4i4n1JtDeW1lGdfJq1hmMK_AMcY_oT42rtD7r0_qjltw2PYh-F8'

name                    size  creationDate         
---------------------  -----  -------------------  
KernelVersions.csv       2GB  2024-06-18 11:30:01  
KernelVotes.csv        208MB  2024-06-18 11:29:38  
Kernels.csv            177MB  2024-06-18 11:29:38  
Organizations.csv      286KB  2024-06-18 11:29:35  
Submissions.csv          2GB  2024-06-18 11:29:59  
Tags.csv                96KB  2024-06-18 11:29:35  
TeamMemberships.csv    318MB  2024-06-18 11:29:39  
Teams.csv              574MB  2024-06-18 11:29:43  
UserAchievements.csv     5GB  2024-06-18 11:30:26  
UserFollowers.csv       61MB  2024-06-18 11:29:36  
UserOrganizations.csv   81KB  2024-06-18 11:29:35  
Users.csv                1GB  2024-06-18 11:29:53

If there would be more pages then second page should show token for third page, and third page should show token for fourth page, etc.

If you need it in variable then you can do directly

variable = !kaggle datasets files kaggle/meta-kaggle

it can be more useful if you send it in CSV format

variable = !kaggle datasets files --csv kaggle/meta-kaggle

because it gives list of lines so you can convert it to one string
and later you can use io to load it to DataFrame

import pandas as pd
import io

text = "\n".join(variable)  # use `[1:]` to skip line with PAGE TOKEN or `[2:]` to skip also header
df = pd.read_csv(io.StringIO(text))

print(df)

You may also redirect it to file

!kaggle datasets files --csv kaggle/meta-kaggle > output.csv

and next page you can append using >> instead of >

!kaggle datasets files --csv kaggle/meta-kaggle --page-token ... >> output.csv

but it adds second header as row of data and it needs to remove it later.

Getting information with Python

import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

data = []

# first page
page_token = None
#page_size = 20

while True:
    print('loading page ...')

    result = api.datasets_list_files('kaggle', 'meta-kaggle', page_token=page_token) #, page_size=page_size)
    #print('-- result keys ---')
    #print('\n'.join(sorted(result.keys())))
    #page_token = result['nextPageToken']   # I'm not sure if key can exists if there is no next page and don't want to check key `hasNextPageToken`
    page_token = result.get('nextPageToken')

    for item in result['datasetFiles']:
        #print('-- item keys ---')
        #print('\n'.join(sorted(item.keys())))
        data.append( [item['name'], item['totalBytes']] )

    if not page_token:
        break

print('len(data):', len(data))
for index, (name, size) in enumerate(data, 1):
    print(f'{index:3} | {size:15,} | {name}')

Result:

loading page ...
loading page ...
len(data): 32
  1 |          23,149 | CompetitionTags.csv
  2 |       2,601,077 | Competitions.csv
  3 |       9,251,082 | DatasetTags.csv
  4 |         663,759 | DatasetTaskSubmissions.csv
  5 |       7,918,950 | DatasetTasks.csv
  6 |     978,416,044 | DatasetVersions.csv
  7 |      85,737,991 | DatasetVotes.csv
  8 |      42,800,784 | Datasets.csv
  9 |      20,060,921 | Datasources.csv
 10 |  13,291,567,707 | EpisodeAgents.csv
 11 |   3,719,462,511 | Episodes.csv
 12 |     169,295,986 | ForumMessageVotes.csv
 13 |   1,365,573,428 | ForumMessages.csv
 14 |      57,075,670 | ForumTopics.csv
 15 |      15,698,393 | Forums.csv
 16 |             410 | KernelLanguages.csv
 17 |      24,133,789 | KernelTags.csv
 18 |      79,198,822 | KernelVersionCompetitionSources.csv
 19 |     272,485,582 | KernelVersionDatasetSources.csv
 20 |      27,251,969 | KernelVersionKernelSources.csv
 21 |   1,721,607,368 | KernelVersions.csv
 22 |     217,656,945 | KernelVotes.csv
 23 |     186,013,899 | Kernels.csv
 24 |         293,159 | Organizations.csv
 25 |   1,854,509,529 | Submissions.csv
 26 |          98,002 | Tags.csv
 27 |     333,276,608 | TeamMemberships.csv
 28 |     601,950,035 | Teams.csv
 29 |   5,596,699,151 | UserAchievements.csv
 30 |      63,464,540 | UserFollowers.csv
 31 |          83,301 | UserOrganizations.csv
 32 |   1,088,130,308 | Users.csv