I have a for-loop in a program I am running with Node.js. The function is x() from the xray package and I am using it to scrape and receive data from a webpage and then write that data to a file. This program is successful when used to scrape ~100 pages, but I need to scrape ~10000 pages. When I try to scrape a very large amount of pages, the files are created but they do not hold any data. I believe this problem exists because the for-loop is not waiting for x() to return the data before moving on to the next iteration.
Is there a way to make node wait for the x() function to complete before moving on to the next iteration?
//takes in file of urls, 1 on each line, and splits them into an array.
//Then scrapes webpages and writes content to a file named for the pmid number that represents the study
//split urls into arrays
var fs = require('fs');
var array = fs.readFileSync('Desktop/formatted_urls.txt').toString().split("\n");
var Xray = require('x-ray');
var x = new Xray();
for(i in array){
//get unique number and url from the array to be put into the text file name
number = array[i].substring(35);
url = array[i];
//use .write function of x from xray to write the info to a file
x(url, 'css selectors').write('filepath' + number + '.txt');
}
Note: Some of the pages I am scraping do not return any value
The problem with your code is that you're not waiting for the files to be written to the file system. A better way than downloading the files one by one is to do them in one go and then wait till they complete, rather than processing them one by one before proceeding to the next.
One of the recommended libraries for dealing with promises in nodejs, is bluebird.
http://bluebirdjs.com/docs/getting-started.html
In the updated sample (see below), we iterate through all of the urls and start the download, and keep track of the promises, and then once the files have been written each promise is resolved. Finally, we just wait on all of the promises to get resolved using Promise.all()
Here's the updated code:
var promises = [];
var getDownloadPromise = function(url, number){
return new Promise(function(resolve){
x(url, 'css selectors').write('filepath' + number + '.txt').on('finish', function(){
console.log('Completed ' + url);
resolve();
});
});
};
for(i in array){
number = array[i].substring(35);
url = array[i];
promises.push(getDownloadPromise(url, number));
}
Promise.all(promises).then(function(){
console.log('All urls have been completed');
});