Search code examples
javascriptapify

Apify web scraper ignoring URL Fragment


I have a list of URL that I want to scrape, so i put it into the startUrls like this

"startUrls": [
    {
      "url": "https://www.example.com/sample#000000",
      "method": "GET"
    },
    {
      "url": "https://www.example.com/sample#111111",
      "method": "GET"
    }
  ]

And this is the excerpt from my pageFunction code.

async function pageFunction(context) {
  const { request } = context;
  var name;
  try {
     name = document.querySelector('h1').textContent;
  } catch (e) {
     name = "null";
  }
  return {
     link: request.url,
     name
  };
}

It's working fine with URLs that can be differentiated with either the domain, or the path. But if the only difference is in the fragment, only the first URL is processed as the second URL is considered a duplicate and therefore skipped.

i've tried adding this bit of code at the second line of the pageFunction

await context.enqueueRequest({
  url: context.request.url,
  keepUrlFragment: true,
});

But it leads up to another problem that it's producing duplicate results for each URL.

So what should I do to make this work correctly? Is there another way than calling enqueueRequest to set the keepUrlFragment to true ?


Solution

  • Unfortunately, you cannot set keepUrlFragment directly in startUrls now. So I propose to not use them at all. You can instead pass them as an array in customData. Then you can use page function like this with a dummy startUrl like http://example.com and label START

    async function pageFunction(context) {
      const { request, customData } = context;
      if (request.userData.label === 'START') {
         for (const url of customData) {
            await context.enqueueRequest({
              url,
              keepUrlFragment: true,
            });  
         }
      } else {
         // Your main scraping logic here
      }
    
    }