Search code examples
javascripthtmlweb-scrapingweb-worker

Webscraping without Node js possible?


I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.

Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).

E.g. I would like to do:

Get first url link of a google image search.

Edit:

I now tried it and it worked find however after 2 Weeks I get now this error:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at .... (Reason: CORS header ‘Access-Control-Allow-Origin’ missing).

any ideas how to solve that?

Here is the error described by firefox: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS/Errors/CORSMissingAllowOrigin


Solution

  • Yes, this is possible. Just use the XMLHttpRequest API:

    var request = new XMLHttpRequest();
    request.open("GET", "https://bypasscors.herokuapp.com/api/?url=" + encodeURIComponent("https://duckduckgo.com/html/?q=stack+overflow"), true);  // last parameter must be true
    request.responseType = "document";
    request.onload = function (e) {
      if (request.readyState === 4) {
        if (request.status === 200) {
          var a = request.responseXML.querySelector("div.result:nth-child(1) > div:nth-child(1) > h2:nth-child(1) > a:nth-child(1)");
          console.log(a.href);
          document.body.appendChild(a);
        } else {
          console.error(request.status, request.statusText);
        }
      }
    };
    request.onerror = function (e) {
      console.error(request.status, request.statusText);
    };
    request.send(null);  // not a POST request, so don't send extra data

    Note that I had to use a proxy to bypass CORS issues; if you want to do this, run your own proxy on your own server.