Search code examples
javascriptnode.jsweb-scrapingrequestcheerio

How to avoid - Error 403 while web scraping using cheerio


I'm web scraping a website and I have an array of links:

 'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Abercorn',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Longueuil',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Sainte-Anne-De-Bellevue',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Shawinigan',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Chateauguay',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Mont-Laurier',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Georges',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Sherbrooke',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Chicoutimi',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Montreal',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Henri-De-Levis',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Stukely-Sud',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Drummondville',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Montreal-Est',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Hubert',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Trois-Rivieres',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Gatineau',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Montreal-Nord',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Jerome',
  "http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Val-D'or",
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Granby',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Montreal-Ouest',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Lambert',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Verdun',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Lachine',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Quebec',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Laurent',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Warwick',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Lasalle',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Rigaud',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Leonard',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Westmount',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Laval',
  'http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Roxboro'

But when I do the request some of those links return error 403 - Forbidden.

Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Verdun
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Granby
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Lambert
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Val-D'or
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Lasalle
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Laval
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Warwick
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Quebec
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Westmount
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Roxboro
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Laurent
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Rigaud
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Saint-Leonard
Error null Forbidden http://www.adventistdirectory.org/SearchResults.aspx?CtryCode=CA&StateProv=QC&City=Lachine

When I use a list with fewer links it works perfectly.

Here is my code:

 const request = require('request');
  const cheerio = require('cheerio');
      
  function readChurches(cities){
        const churches = []
        for (let index = 0; index < cities[0].length; index++){
            const city = cities[0][index];
            churches.push(new Promise((resolve, reject) => {
                const church = [] 
                let options = {
                    url: city,
                    headers: {
                         'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
                    }
                };
                
                request(options, (error, response, html) => {
                    if(!error && response.statusCode == 200) {
                        const $ = cheerio.load(html);
                        const $$ = cheerio.load($('table').find('tbody').eq(1).find('tr').eq(1).find('td').eq(1).html())
        
                        $$('a').each((i, el) => {
                            const item = $(el).attr('href')
                            if(item != undefined){
                                if(item.includes('ViewEntity')) {
                                    church.push(`http://www.adventistdirectory.org${item}`);
                                }    
                            }
                        });
                        resolve(church);
                    } else {
                        console.log('Error',error,response.statusMessage,city)
                        reject(error)
                    }
                });
            }))
        }
    
        return Promise.all(churches);
    }

What can be done to bypass error 403?. Because when I try to open the link on my browser it works, when I use the javascript function doesn't work though.

--- NEW UPDATES ---

I've changed to code. I added a try catch block

function readChurches(cities){
    const churches = []
    for (let index = 0; index < cities[0].length; index++){
        const city = cities[0][index];
        churches.push(new Promise((resolve, reject) => {
            const church = [] 
            let options = {
                url: city,
                headers: {
                     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
                }
            };

            try {
                request(options, (error, response, html) => {
                    if(!error && response.statusCode == 200) {
                        const $ = cheerio.load(html);
                        const $$ = cheerio.load($('table').find('tbody').eq(1).find('tr').eq(1).find('td').eq(1).html())
        
                        $$('a').each((i, el) => {
                            const item = $(el).attr('href')
                            if(item != undefined){
                                if(item.includes('ViewEntity')) {
                                    church.push(`http://www.adventistdirectory.org${item}`);
                                }    
                            }
                        });
                        resolve(church);
                    } 
                });
                
            } catch (error) {
                console.log('Error',error,city)
                reject(error)
            }
        }))
    }

    return churches
}

and also, created this function, provided by @chrispytoes

async function doStuff(churches) {
    const results = [];
    for(let i in churches) {
        try {
            console.log(churches[i])
            results.push(await churches[i]);
            sleep(5000);    
        } catch (error) {
            console.log(error)
        }
    }

    return results
}

and I'm running it:

async function run(){
    
        let provinces = []
        provinces.push(`http://www.adventistdirectory.org/BrowseStateProv.aspx?CtryCode=CA&StateProv=QC`)
        
        let cities = await readCities(provinces);
    
        const churches = await readChurches(cities);
        const stuff = await doStuff(churches)
        
        console.log('Churches: ', stuff);
    
        console.log('End')
        
    } catch (error) {
        console.log('Error', error)
    }
}

And I'm getting this on my console:

Promise { <pending> }
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=19653'
  ]
}
Promise { <pending> }
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=19637'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=54633'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=31155'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=15271'
  ]
}
Promise { <pending> }
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=30783'
  ]
}
Promise { <pending> }
Promise { <pending> }
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=15265'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=15255'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=15251'
  ]
}
Promise { <pending> }
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=15247'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=19645'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=32838'
  ]
}
Promise {
  [
    'http://www.adventistdirectory.org/ViewEntity.aspx?EntityID=29973'
  ]
}
Promise { <pending> }

it is not getting to console.log('Churches: ', stuff);


Solution

  • The error being sent back is the choice of the server you are making the request to, so there's no universal way to "avoid" it. You are probably making the requests too fast and they are blocking you for using too much bandwidth.

    Using Promise.all is making all the requests at once. You need to make a loop of sorts to make the requests go one at a time.

    So something like this may work:

    const wait = async (time) =>
      new Promise((res, rej) => setTimeout(() => res(), time));
    
    async function doStuff() {
      const results = [];
      for(let i in churches) {
        await wait(1000);
        const result = await churches[i];
        console.log(result);
        results.push(result);
      }
    }