I'm just starting off with Node.js and struggling with some of the finer points of non-blocking (asynchronous?) code. I know there are lots of questions about blocking vs non-blocking code already, but after reading through some of them, I still couldn't sort out this issue.
As a learning exercise, I made a simple script that loads URLs from a file, queries them using the request
module, and notifies me if a URL is the New York Times homepage.
Here is a MWE:
// CSV Parse test
'use strict';
var request = require('request');
var fs = require('fs');
var parse = require('csv-parse');
var text = fs.readFileSync('input.txt','utf8');
var r_isnyt = /New York Times/;
var data = [];
parse(text, {skip_empty_lines: true}, function(err, data){
for (var r = 0; r < data.length; r++) {
console.log ('Logging from within parse function:');
console.log ('URL: '+data[r][0]+'\n');
var url = data[r][0];
request(url, function(error, response, body) {
console.log ('Logging from within request function:');
console.log('Loading URL: '+url+'\n');
if (!error && response.statusCode == 200) {
if (r_isnyt.exec(body)){
console.log('This is the NYT site! ');
}
console.log ('');
}
});
}
});
And here is my input.txt
:
http://www.nytimes.com/
www.google.com
From what I understood of non-blocking code, this program's flow would be:
parse(text, {skip_empty_lines: true}, function(err, data){
loads the data and returns the lines of the input file in a 2D array, which is complete and available right after that line.
For
Loop iterates through it, loading URLs with the line request(url, function(error, response, body) {
, which is non-blocking (right?), so the For loop continues without waiting for the previous URL to finish loading.
As a result, you could have multiple URLs being loaded at once, and the console.log
calls within request
will print in the order the responses are received, not the order of the input file.
Within request
, which has access to the results of the request to url
, we print the URL, check whether it's the New York Times, and print the result of that check (all blocking steps I thought).
That's a long-winded way of getting around to my question. I just wanted to clarify that I thought I understood the basic concepts of non-blocking code. So what's baffling me is that my output is as follows:
>node parsecsv.js
Logging from within parse function:
URL: http://www.nytimes.com/
Logging from within parse function:
URL: www.google.com
Logging from within request function:
Loading URL: www.google.com
Logging from within request function:
Loading URL: www.google.com
This is the NYT site!
>
I understand why the request
printouts all happen together at the end, but why do they both print Google, and much more baffling, why does the last one say it's the NYT site, when the log line right before it (from within the same request
call) has just printed Google? It's like the request
calls are getting the correct URLs, but the console.log
calls are lagging, and just print everything at the end with the ending values.
Interestingly, if I reverse the order of the URLs, everything looks correct in the output, I guess because of differences in response times from the sites:
node parsecsv.js
Logging from within parse function:
URL: www.google.com
Logging from within request function:
Loading URL: www.google.com
Logging from within parse function:
URL: http://www.nytimes.com/
Logging from within request function:
Loading URL: http://www.nytimes.com/
This is the NYT site!
>
Thanks in advance.
Update
Based on the answer from jfriend00 below, I've changed my code to use a .forEach
loop instead as follows. This appears to fix the issue.
// CSV Parse test
'use strict';
var request = require('request');
var fs = require('fs');
var parse = require('csv-parse');
var text = fs.readFileSync('input.txt','utf8');
var r_isnyt = /New York Times/;
var data = [];
parse(text, {skip_empty_lines: true}, function(err, data){
data.forEach( function(row) {
console.log ('Logging from within parse function:');
console.log ('URL: '+row[0]+'\n');
let url = row[0];
request(url, function(error, response, body) {
console.log ('Logging from within request function:');
console.log('Loading URL: '+url+'\n');
if (!error && response.statusCode == 200) {
if (r_isnyt.exec(body)){
console.log('This is the NYT site! ');
}
console.log ('');
}
});
});
});
I understand why the request printouts all happen together at the end, but why do they both print Google, and much more baffling, why does the last one say it's the NYT site, when the log line right before it (from within the same request call) has just printed Google? It's like the request calls are getting the correct URLs, but the console.log calls are lagging, and just print everything at the end with the ending values.
You correctly understand that the for
loop initiates all the request()
calls and then they finish sometime later in whatever order the responses come back in.
But, your logging statement:
console.log('Loading URL: '+url+'\n');
refers to a variable in your for
loop which is shared by all the iterations of your for
loop. So, since the for
loop runs to completion and THEN sometime later all the responses arrive and get processed, your for
loop will have finished by the time any of the responses get processed and thus the variable url
will have whatever value it has in it when the for
loop finishes which will be the value from the last iteration of the for
loop.
In ES6, you can define the variable with let
instead of var
and it will be block scopes so there will be a unique variable url
for each iteration of the loop.
So, change:
var url = data[r][0];
to
let url = data[r][0];
Prior to ES6, a common way to avoid this issue is to use .forEach()
to iterate since it takes a callback function so all your loop code is in its own scope by nature of how .forEach()
works and thus each iteration has its own local variables rather than shared local variables.
FYI, though let
solves this issue and is one of the things it was designed for, I think your code would probably be a bit cleaner if you just used .forEach()
for your iteration since it would replace multiple references to data[r]
with a single reference to the current array iteration value.
parse(text, {skip_empty_lines: true}, function(err, data){
data.forEach( function(row) {
console.log ('Logging from within parse function:');
console.log ('URL: '+row[0]+'\n');
let url = row[0];
request(url, function(error, response, body) {
console.log ('Logging from within request function:');
console.log('Loading URL: '+url+'\n');
if (!error && response.statusCode == 200) {
if (r_isnyt.exec(body)){
console.log('This is the NYT site! ');
}
console.log ('');
}
});
});
});