Search code examples
javascriptnode.jses6-promiserequest-promise

How to use multiple promises in recursion?


I am trying to solve the problem where the script enters a website, takes the first 10 links from it and then goes on those 10 links and then goes on to the next 10 links found on each of these 10 previous pages. Until the number of visited pages will be 1000. This is what it looks like: I was trying to get this by using for loop inside promise and recursion, this is my code:

const rp = require('request-promise');
const url = 'http://somewebsite.com/';

const websites = []
const promises = []

const getOnSite = (url, count = 0) => {
    console.log(count, websites.length)
    promises.push(new Promise((resolve, reject) => {
        rp(url)
            .then(async function (html) {
                let links = html.match(/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/g)
                if (links !== null) {
                    links = links.splice(0, 10)
                }
                websites.push({
                    url,
                    links,
                    emails: emails === null ? [] : emails
                })
                if (links !== null) {
                    for (let i = 0; i < links.length; i++) {
                        if (count < 3) {
                            resolve(getOnSite(links[i], count + 1))
                        } else {
                            resolve()
                        }
                    }
                } else {
                    resolve()
                }

            }).catch(err => {
                resolve()
            })
    }))

}

getOnSite(url)

Solution

  • I think you might want a recursive function that takes three arguments:

    1. an array of urls to extract links from
    2. an array of the accumulated links
    3. a limit for when to stop crawling

    You'd kick it off by calling it with just the root url, and await all of the returned promises:

    const allLinks = await Promise.all(crawl([rootUrl]));
    

    On the initial call the second and third arguments could assume default values:

    async function crawl (urls, accumulated = [], limit = 1000) {
      ...
    }
    

    The function would fetch each url, extract its links, and recurse until it hit the limit. I haven't tested any of this, but I'm thinking something along these lines:

    // limit the number of links per page to 10
    const perPageLimit = 10;
    
    async function crawl (urls, accumulated = [], limit = 1000) {
    
      // if limit has been depleted or if we don't have any urls,
      // return the accumulated result
      if (limit === 0 || urls.length === 0) {
        return accumulated;
      }
    
      // process this set of links
      const links = await Promise.all(
        urls
          .splice(0, perPageLimit) // limit to 10
          .map(url => fetchHtml(url) // fetch the url
          .then(extractUrls)); // and extract its links
      );
    
      // then recurse
      return crawl(
        links, // newly extracted array of links from this call
        [...accumulated, links], // pushed onto the accumulated list
        limit - links.length // reduce the limit and recurse
      );
    }
    
    async fetchHtml (url) {
       //
    }
    
    const extractUrls = (html) => html.match( ... )