Search code examples
downloadnlpdatasetdropbox

How to programmatically download many large files from dropbox


The National Speech Corpus is a Natural Language Processing corpus of Singaporean's speaking English, which can be found here: https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus.

When you sign up for the free corpus, you are directed to a dropbox folder. The corpus is 1 TB and (as of this writing) has four parts. I only wanted to download PART 1 but even this has 1446 zip files that are each quite larger. My question is: how do I programmatically download many large files from dropbox onto a Linux (Ubunut 16.04) VM using only the command line.

The directory tree for the relevant part looks like:

root
|-LEXICON
|-PART1
  |-DATA
    |-CHANNEL0
      |-WAVE
        |-SPEAKER0001.zip
        |-SPEAKER0002.zip
        ...
        |-SPEAKER1446.zip

I looked into a few different approaches:

  1. Downloading the WAVE parent directory using a shared link via the wget command as described in this question. However, this didn't work as I received this error:

    Reusing existing connection to www.dropbox.com:443 HTTP request sent, awaiting response... 400 Bad Request 2021-01-06 23:09:06 ERROR 400: Bad Request.

I assumed this was because the WAVE directory was too large for Dropbox to zip.

  1. Based on this post, it was suggested that I could download the HTML of the WAVE parent directory and find all of the direct links to the individual zip files but the direct links to the individual files were not in the HTML file.

  2. Based on the same post as in (2), I could also try to create shared links for each zip file using the dropbox API, though this seemed too cumbersome.

  3. Download the Linux dropbox client and sync the relevant files as outlined in this installation.

In the end, the 4th option did work for me, but I wanted to post this investigation for anyone who needs to download this dataset in the future. Also, I wanted to see if anyone else had better approaches.


Solution

  • As I described, the approach that worked for me was to use Dropbox's linux client to sync the files on to my Linux VM. You can follow these instructions to download the Linux client. These instructions worked for me on my Ubuntu 16.04 VM.

    One issue I encounter with the sync client was how to selectively exclude directories. I only had 630 GB on my VM and the entire National Speech Corpus size is 1TB, so I needed to exclude files before the Dropbox sync filled up my disk.

    You can selectively exclude files using the dropbox python script that is at the bottom of the installation page. A link to the script is here. Calling the python script from my home directory (where the Dropbox sync folder is automatically installed) worked using the command:

    python dropbox.py exclude add ~/Dropbox/<path_to_excluded_dir>
    

    You may want to stop and start the dropbox client which can be done through:

    python dropbox.py start
    python dropbox.py stop
    

    Finally, see the command in the python script for more information:

    python dropbox.py --help
    

    With this approach, I was able to easily download the desired files without overwhelming my VM.