Search code examples
javascriptnode.jscheerio

Getting Value outside tag with web scraper JavaScript


I'm attempting to create a scraper broken into two classes. One being a backend that will scrap a value from a website & return it the another calling class where for now it'll be printed. My problem is I'm stuck when it comes to getting a value defined outside a tag. I.e. <div class="temp">13</div>

Here is my backend so far, it takes a url in the get function in the event I want to add more classes that use it in the future

const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const app = express()
const cors = require('cors')
const url = require("url");
app.use(cors())

app.get('/temp/:url1', (req, res) => {
    axios(url1)
        .then(response => {
            const html = response.data
            const $ = cheerio.load(html)
            const value = []
            
            *stuck here*
          
        }).catch(err => console.log(err))

})

app.listen(PORT, () => console.log(`server running on PORT ${PORT}`))

Here is my first app. It's only calling fetch and printing the values

url1 = 'https://www.walmart.com/ip/Hind-Boys-Active-Shirts-Shorts-and-Jogger-Pants-8-Piece-Outfit-Set-Sizes-4-16/952146762?athcpid=952146762&athpgid=AthenaHomepageDesktop__gm__-1.0&athcgid=null&athznid=SeasonalCampaigns_d396fb61-c3c0-46db-a4d9-aaf34191b39f_items&athieid=null&athstid=CS020&athguid=kZNrXnatcjxcgUvbKkvbwYMT4bwAapwfOaos&athancid=null&athena=true&athbdg=L1400'
//(in this instance, the value I'm attempting to get is the "Now 24.99" portion)
fetch('http://localhost:8000/temp/' + url1)
    .then(response => {return response.json()})
    .then(data => {
        console.log(data)
    })
    .catch(err => console.log(err))

To make it easier here is the HTML from the url

<span itemprop="price" aria-hidden="false">Now $24.97</span>

Solution

  • I see two options here:

    • The server parses the page and returns the price to the client
    • The server passes the HTML response to the client, who is responsible for parsing

    If the /temp (or /bids) route is supposed to handle arbitrary URLs, the second option makes more sense. But if the contract with the client is that they're all the same sort of URL, then you can (and probably should) do the scraping on the server:

    const axios = require("axios");
    const cheerio = require("cheerio");
    const app = require("express")();
    
    const ua =
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
    
    app.get("/bids", (req, res) => {
      axios
        .get(req.query.url, {headers: {"User-Agent": ua}})
        .then(({data: html}) => {
          const $ = cheerio.load(html);
          res.json({price: $("[itemprop='price']").text().trim()});
        })
        .catch(err => res.status(400).json({error: "Bad request"}));
    });
    
    app.listen(8000, () => console.log("listening on port 8000"));
    

    A few things to note:

    • You can use req.query and pass a query string like ?url=https://www.example.com. req.params seems to be confused by paths like /bids/https://www.example.com. It's possible but maybe ugly? You could also accept a POST JSON payload with the URL.
    • I'm using a user agent string to (help) avoid blocking.
    • [itemprop="price"] is the CSS selector I'm using for your element. I'm nitpicking, but I'd say the value you want is inside the element rather than outside. It's the text content of the element (as opposed to an attribute, the foo="bar" pairs).
    • I'm not really doing much in the way of handling errors, but this is important.

    A general tip: try to decompose and minimize your problems and work on one at a time. Getting the URL to the server, making the request and parsing the HTML are all totally different steps. If you haven't validated that your URL parameter is coming through correctly, you might be confused if you're off working on selecting an element on a response that isn't what you think it is.

    If you want to focus on Cheerio parsing, temporarily hardcode the HTML to simplify the problem space, avoiding bugs that could be elsewhere in the app getting in the way. Or hardcode the URL if you want to focus on making the request work.