Search code examples
javascriptnode.jsweb-scrapingaxioscheerio

404 response while using axios.get on a live-server


I'm learning web-scraping with JavaScript, and while trying to log to the console a simple web page, I'm getting a weird 404 error:

Failed to load resource: the server responded with a status of 404 (Not Found) Refused to execute script from '[...]script' because its MIME type ('text/html') is not executable, and strict MIME type checking is enabled.

I suspect the second error is just a side effect of the failed await axios.get(url) not working properly.

my code:

import { load } from "cheerio";
import axios from "axios";

const testGet = async function (url) {
  try {
    const response = await axios.get(url);
    const html = response.data;
    const $ = load(html);
    console.log($.html());
  } catch (error) {
    console.error(error);
  }
};

const url = "https://books.toscrape.com";
testGet(url);

Please do note that I can of course access the contents of https://books.toscrape.com just fine using a regular web-browser.

Already made sure my package.json lists both axios and cheerio like so:

{
  "type": "module",
  "dependencies": {
    "axios": "^1.6.8",
    "cheerio": "^1.0.0-rc.12"
  }
}

node -v:

v18.17.1

Restarted the live server multiple times, checked for typos with the help of CoPilot and chatGPT but nothing came up.

Reinstalled cheerio and manually added ""type": "module"" to the package.json file so now the script works as intended when running directly in the terminal with node .\script.js, but still doesn't work when run in the live-server.


Solution

  • Based on the discussion in the comments, you're trying to run scraping code from a website hosted on GitHub Pages. The problem is, most scraping happens from the backend, where requests aren't blocked by CORS. CORS is a server-side restriction to prevent websites from making requests to other origins.

    Try this code, which is enough to show the problem:

    fetch("https://books.toscrape.com").catch(err => console.error(err));

    You should see something in your browser dev tools console like:

    Cross-Origin Request Blocked:
    The Same Origin Policy disallows reading the remote resource at https://books.toscrape.com/.
    (Reason: CORS header ‘Access-Control-Allow-Origin’ missing). Status code: 200.
    

    ...but no error in Node.

    Your options for dealing with this include:

    1. Hosting a server that can proxy the request for you. GH Pages doesn't have a backend, but you can do this with Express running on a site that does support a backend, like glitch. You'd use your existing Axios + Cheerio code to make the request and extract the data you want in an Express route, then return the scraped data to your frontend dashboard. Your Express server would explicitly allow requests from your GH pages domain cross origin by setting the access control HTTP header:

      res.setHeader("Access-Control-Allow-Origin", "https://yourname.github.io");
      
    2. Using a GitHub Action to run your scraping code and write the data to a static file in your GitHub repo. This is a good technique if the data doesn't change often--you can run your action daily or weekly. See this blog post for a walkthrough of how you can set this up.

    3. Using cors-anywhere to proxy the request without hosting your own backend. You can use the cors-anywhere demo server to test, but in the long run, you're supposed to host your own instance, which you can do for free on Render.com (at least at the time of writing):

    fetch("https://cors-anywhere.herokuapp.com/" + "https://books.toscrape.com")
      .then(response => response.text())
      .then(text => {
        const doc = new DOMParser().parseFromString(text, "text/html");
        const title = doc.querySelector("title").textContent.trim();
        console.log(title);
      })
      .catch(err => console.error(err));

    For this snippet to work, you'll need to go to http://cors-anywhere.herokuapp.com/corsdemo and click the button to get temporary access.

    Note that there's no real need to use axios and cheerio in the frontend, only on the backend. The front end already has fetch, jQuery and the native DOM parser. Cheerio is a port of jQuery to Node, so bringing it back to the browser doesn't make sense. Axios needs to be loaded to use in the browser, so the slight syntactic simplicity it offers doesn't compete with native fetch.