Search code examples
node.jsmultithreadingrestgoogle-apigmail-api

Getting emails from Google API very slow - multithreading needed?


I'm building an app that needs to get every single email from a user's Gmail account - so more than 100,000 in some cases. For those who don't know, to get information about each email one must use the list api call first, which only returns email ids. This process is alright - with my testing getting the ids of 200,000 emails takes ~90 seconds using the Node.js Google API library. But to extract information from each email, you must pass the id to the get api call, and using the aforementioned API library this process is very slow.

I've used the library Batchelor to make batch api calls with batch sizes of 10, and I've even utilized partial requests, only requesting the email snippet field. Even with these measures the program is only able to get information from ~1000 emails in 30 seconds, not to mention the inconsistent times for each batch call. Here is my code:

async function getEmails(){
    var batchSize = 10
    var ids = []           // List of email ids, filled beforehand

    for(var i = 0; i < ids.length; i++){
        batch.add({
            'method': 'GET',
            'path': '/gmail/v1/users/me/messages/' + ids[i] + '?fields=snippet'  // Request partial response
        })

        // Run in batches of size batchSize
        if( (i + 1) % batchSize == 0 || i + 1 == ids.length){
            try {
                await runBatch()
            }
            catch (err) { console.log('Error batching: ' + err.toString()) } 
        }
    }
}

async function runBatch(){
    return new Promise((resolve, reject) => {
        batch.run(function(err, response){
            if(err){
                reject(err);
            }
            else {
                // Do something with response
                batch.reset(); // Must reset batch before next batch call
                resolve();
            }
        })
    })
}


Is there something I'm doing wrong? Should I be using a different Google API Library? Or is this the limitation of Node.js being single-threaded? In that case, would it be more optimal to use a different backend language such as Python/Java for something like this? Thanks.


Solution

  • This is a free api you are confined by the limits that google places on this api.

    Your going to be throttled if you go to fast and batching isn't going to help with that all batching will do is save you the extra http calls its not going to get you the information any faster.