Search code examples
pythondatasetgoogle-colaboratorykaggle

How to list all files from a Kaggle dataset in Google Colab?


I've been experimenting with the Kaggle API in Google Colab for a while now and I'm stuck with the following problem. I'm able to easily authenticate my credentials, and got no problem downloading whole datasets, as well as specific files using:

!kaggle datasets download -d <user>/<dataset>
!kaggle datasets download <user>/<dataset> -f <specific_file>

However, I'm not able to get the list of all the files in a dataset (which I would like to save in a variable).

Whenever I'm using:

api.dataset_list_files('<user>/<dataset>').files

I'm getting a list with blank spaces equal to the number of files in the respective dataset. I didn't find a mention to anything like that in the internet, so I guess that maybe it should be a recent bug/problem. In addition, I can actually use:

!kaggle datasets files <user>/<dataset>

To correctly list the first 20 files, but it isn't very helpful, as I don't know how to see the rest nor how to save it in a variable.

I suppose that maybe I can come up with a complex solution that employs Selenium or something like that, but I think that would a bit of an overkill. That's why I come here in search of the wisdom of more seasoned Kaggle API users, or someone who has also faced and solved this problem. Could you help me, please?


Solution

  • I installed kaggle on local computer and checked it with option --help

    $ kaggle datasets files --help
    
    usage: kaggle datasets files [-h] [-v] [--page-token PAGE_TOKEN] [--page-size PAGE_SIZE] [dataset]
    
    options:
      -h, --help            show this help message and exit
      dataset               Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)
      -v, --csv             Print results in CSV format (if not set print in table format)
      --page-token PAGE_TOKEN
                            Page token for results paging.
      --page-size PAGE_SIZE
                            Number of items to show on a page. Default size is 20, max is 200
    

    It shows that default value is 20 items on page
    but you can use --page-size 200 to get max 200 files at once.

    If dataset has more files then it shows PAGE_TOKEN to load next page.

    $ kaggle datasets files kaggle/meta-kaggle
    
    Next Page Token = CfDJ8CHCUm6ypKVLpjizcZHPE70-HT6X7bGt2XVG4i4n1JtDeW1lGdfJq1hmMK_AMcY_oT42rtD7r0_qjltw2PYh-F8
    name                                  size  creationDate         
    -----------------------------------  -----  -------------------  
    CompetitionTags.csv                   23KB  2024-06-18 11:28:01  
    Competitions.csv                       2MB  2024-06-18 11:28:02  
    DatasetTags.csv                        9MB  2024-06-18 11:28:02  
    DatasetTaskSubmissions.csv           648KB  2024-06-18 11:28:02  
    DatasetTasks.csv                       8MB  2024-06-18 11:28:02  
    DatasetVersions.csv                  933MB  2024-06-18 11:28:21  
    DatasetVotes.csv                      82MB  2024-06-18 11:28:09  
    Datasets.csv                          41MB  2024-06-18 11:28:09  
    Datasources.csv                       19MB  2024-06-18 11:28:08  
    EpisodeAgents.csv                     12GB  2024-06-18 11:33:22  
    Episodes.csv                           3GB  2024-06-18 11:30:10  
    ForumMessageVotes.csv                161MB  2024-06-18 11:29:37  
    ForumMessages.csv                      1GB  2024-06-18 11:29:51  
    ForumTopics.csv                       54MB  2024-06-18 11:29:36  
    Forums.csv                            15MB  2024-06-18 11:29:35  
    KernelLanguages.csv                   410B  2024-06-18 11:29:35  
    KernelTags.csv                        23MB  2024-06-18 11:29:35  
    KernelVersionCompetitionSources.csv   76MB  2024-06-18 11:29:36  
    KernelVersionDatasetSources.csv      260MB  2024-06-18 11:29:41  
    KernelVersionKernelSources.csv        26MB  2024-06-18 11:29:35  
    

    And next page

    $ kaggle datasets files kaggle/meta-kaggle --page-token 'CfDJ8CHCUm6ypKVLpjizcZHPE70-HT6X7bGt2XVG4i4n1JtDeW1lGdfJq1hmMK_AMcY_oT42rtD7r0_qjltw2PYh-F8'
    
    name                    size  creationDate         
    ---------------------  -----  -------------------  
    KernelVersions.csv       2GB  2024-06-18 11:30:01  
    KernelVotes.csv        208MB  2024-06-18 11:29:38  
    Kernels.csv            177MB  2024-06-18 11:29:38  
    Organizations.csv      286KB  2024-06-18 11:29:35  
    Submissions.csv          2GB  2024-06-18 11:29:59  
    Tags.csv                96KB  2024-06-18 11:29:35  
    TeamMemberships.csv    318MB  2024-06-18 11:29:39  
    Teams.csv              574MB  2024-06-18 11:29:43  
    UserAchievements.csv     5GB  2024-06-18 11:30:26  
    UserFollowers.csv       61MB  2024-06-18 11:29:36  
    UserOrganizations.csv   81KB  2024-06-18 11:29:35  
    Users.csv                1GB  2024-06-18 11:29:53  
    

    If there would be more pages then second page should show token for third page, and third page should show token for fourth page, etc.


    If you need it in variable then you can do directly

    variable = !kaggle datasets files kaggle/meta-kaggle 
    

    it can be more useful if you send it in CSV format

    variable = !kaggle datasets files --csv kaggle/meta-kaggle 
    

    because it gives list of lines so you can convert it to one string
    and later you can use io to load it to DataFrame

    import pandas as pd
    import io
    
    text = "\n".join(variable)  # use `[1:]` to skip line with PAGE TOKEN or `[2:]` to skip also header
    df = pd.read_csv(io.StringIO(text))
    
    print(df)
    

    You may also redirect it to file

    !kaggle datasets files --csv kaggle/meta-kaggle > output.csv
    

    and next page you can append using >> instead of >

    !kaggle datasets files --csv kaggle/meta-kaggle --page-token ... >> output.csv
    

    but it adds second header as row of data and it needs to remove it later.


    Getting information with Python

    import kaggle
    from kaggle.api.kaggle_api_extended import KaggleApi
    api = KaggleApi()
    api.authenticate()
    
    data = []
    
    # first page
    page_token = None
    #page_size = 20
    
    while True:
        print('loading page ...')
    
        result = api.datasets_list_files('kaggle', 'meta-kaggle', page_token=page_token) #, page_size=page_size)
        #print('-- result keys ---')
        #print('\n'.join(sorted(result.keys())))
        #page_token = result['nextPageToken']   # I'm not sure if key can exists if there is no next page and don't want to check key `hasNextPageToken`
        page_token = result.get('nextPageToken')
    
        for item in result['datasetFiles']:
            #print('-- item keys ---')
            #print('\n'.join(sorted(item.keys())))
            data.append( [item['name'], item['totalBytes']] )
    
        if not page_token:
            break
    
    print('len(data):', len(data))
    for index, (name, size) in enumerate(data, 1):
        print(f'{index:3} | {size:15,} | {name}')
    

    Result:

    loading page ...
    loading page ...
    len(data): 32
      1 |          23,149 | CompetitionTags.csv
      2 |       2,601,077 | Competitions.csv
      3 |       9,251,082 | DatasetTags.csv
      4 |         663,759 | DatasetTaskSubmissions.csv
      5 |       7,918,950 | DatasetTasks.csv
      6 |     978,416,044 | DatasetVersions.csv
      7 |      85,737,991 | DatasetVotes.csv
      8 |      42,800,784 | Datasets.csv
      9 |      20,060,921 | Datasources.csv
     10 |  13,291,567,707 | EpisodeAgents.csv
     11 |   3,719,462,511 | Episodes.csv
     12 |     169,295,986 | ForumMessageVotes.csv
     13 |   1,365,573,428 | ForumMessages.csv
     14 |      57,075,670 | ForumTopics.csv
     15 |      15,698,393 | Forums.csv
     16 |             410 | KernelLanguages.csv
     17 |      24,133,789 | KernelTags.csv
     18 |      79,198,822 | KernelVersionCompetitionSources.csv
     19 |     272,485,582 | KernelVersionDatasetSources.csv
     20 |      27,251,969 | KernelVersionKernelSources.csv
     21 |   1,721,607,368 | KernelVersions.csv
     22 |     217,656,945 | KernelVotes.csv
     23 |     186,013,899 | Kernels.csv
     24 |         293,159 | Organizations.csv
     25 |   1,854,509,529 | Submissions.csv
     26 |          98,002 | Tags.csv
     27 |     333,276,608 | TeamMemberships.csv
     28 |     601,950,035 | Teams.csv
     29 |   5,596,699,151 | UserAchievements.csv
     30 |      63,464,540 | UserFollowers.csv
     31 |          83,301 | UserOrganizations.csv
     32 |   1,088,130,308 | Users.csv