Search code examples
githubdownloadkaggle

How to add kaggle dataset download stats in github readme file?


I am trying to add a badge type indicator in the github readme to show how many dataset downloads from the kaggle have occured so far (much like showing page visit count, etc). Is there any way to add this? Particularly, I want to display this number of 602 downloads and the counter should update automatically with new downloads in real-time. enter image description here

I didnt find any specific badge to integrate in the readme (from shield.io or elsewhere).


Solution

  • The simplest way to do it would be to implement your own web scraper and use a github workflow to periodically scrape the required data.

    Crude Python implementation

    Create requirements.txt file

    In the root directory of your github repository, create a requirements.txt file with the following:

    selenium==4.6.0
    

    The selenium web scraper will be used to fetch data from the kaggle website.

    Create script

    In the root directory of your github repository, create a badge_generator.py file with the following:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    import time
    from selenium.webdriver.common.by import By
    
    
    def update_readme(readme_file_path, badge_id, new_badge):
        new_file_content = ''
        # id used to identify position of badge
        line_id = f'![kaggle-badge-{badge_id}]'
        badge_found = False
    
        # open readme and update badge
        with open(readme_file_path, 'r', encoding='utf-8') as f:
            # get all lines in readme
            lines = [line for line in f]
            for i in range(0, len(lines)):
                if line_id in lines[i]:
                    # replace old badge with new badge
                    lines[i] = new_badge
                    badge_found = True
                    break
            # concatenate lines
            new_file_content = ''.join(lines) if len(lines) > 0 else new_badge
    
        if not badge_found:
            raise Exception(
                str(f"Badge {badge_id} not found in {readme_file_path}"))
        # update readme
        with open(readme_file_path, 'w', encoding='utf-8') as f:
            f.write(new_file_content)
    
    
    def create_badge(badge_id, badge_value,
                     badge_name='Downloads', badge_color='orange'):
        badge_url = (f'https://img.shields.io/badge/{badge_name}'
                     f'-{badge_value}-{badge_color}')
        markdown = (f'![kaggle-badge-{badge_id}]({badge_url})\n')
        return markdown
    
    
    def get_download_count(kaggle_url: str):
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--headless')
        driver = webdriver.Chrome(options=chrome_options)
    
        driver.get(kaggle_url)
        time.sleep(3)
        downloads_element = driver.find_element(
            By.XPATH,
            '//*[@id="site-content"]/div[2]/div/div[5]/div[6]/div[2]/div[1]/div/div[3]/h1')
        download_count = downloads_element.get_attribute("textContent")
        return (download_count)
    
    
    def main():
        readme_file_path = "README.md"  # relative to root directory
        # change this url
        url = 'https://www.kaggle.com/datasets/utkarshx27/marijuana-arrests-in-columbia'
        badge_id = 1  # each badge must be given a unique id
        x = get_download_count(url)
        y = create_badge(badge_id, x)
        update_readme(readme_file_path, badge_id, y)
    
    
    main()
    

    Replace the value of url in main function with the URL of the kaggle card.

    README modifications

    Your README.md file must be in root directory. In your README file, add the following line at a line number where you want the badge to be:

    ![kaggle-badge-1]()
    

    This line should be present before running script. When script is run, this line will be overwritten and the badge is updated.

    Do not write anything else on this line.

    Create github workflow

    Create a .github folder in the root directory of your github repository and inside this folder create another folder workflows. Place badge.yml inside workflows:

    name: Kaggle Badge Generator
    
    on:
      push:
      workflow_dispatch:
      schedule:
        - cron: '0 * * * *' #  run every hour
    
    jobs:
      build:
        runs-on: ubuntu-latest
        steps:
    
          - name: checkout repo content
            uses: actions/checkout@v3
    
          - name: setup python with pip cache
            uses: actions/setup-python@v4
            with:
              python-version: '3.9' 
              cache: 'pip' # caching pip dependencies
    
          - name : install any new dependencies
            run: pip install -r requirements.txt
              
          - name: execute py script 
            run: python badge_generator.py 
            
          - name: commit files
            run: |
              git config --local user.email "action@github.com"
              git config --local user.name "GitHub Action"
              git add -A
              timestamp=$(date -u)
              git diff-index --quiet HEAD || (git commit -a -m "Last badge update : ${timestamp}" --allow-empty)
              
          - name: push changes
            uses: ad-m/github-push-action@master
            with:
              github_token: ${{ secrets.GITHUB_TOKEN }}
              branch: main 
    

    The python script will run every hour and will update the badge. The cron job can be modified to run more frequently.

    The badge will look like this: enter image description here

    Your github file directory structure will be like this:

    .github/
    ├─ workflows/
    │  ├─ badge.yml
    badge_generator.py
    requirements.txt
    README.md
    ... your stuffs
    

    Other methods

    If you don't want the script to directly modify your README, you will have to implement some sort of API. Look into free serverless functions on Vercel or REST API on Render. This could pair with the dynamic badges Actions.