I'm attempting to create a scraper broken into two classes. One being a backend that will scrap a value from a website & return it the another calling class where for now it'll be printed. My problem is I'm stuck when it comes to getting a value defined outside a tag. I.e. <div class="temp">13</div>
Here is my backend so far, it takes a url in the get function in the event I want to add more classes that use it in the future
const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const app = express()
const cors = require('cors')
const url = require("url");
app.use(cors())
app.get('/temp/:url1', (req, res) => {
axios(url1)
.then(response => {
const html = response.data
const $ = cheerio.load(html)
const value = []
*stuck here*
}).catch(err => console.log(err))
})
app.listen(PORT, () => console.log(`server running on PORT ${PORT}`))
Here is my first app. It's only calling fetch and printing the values
url1 = 'https://www.walmart.com/ip/Hind-Boys-Active-Shirts-Shorts-and-Jogger-Pants-8-Piece-Outfit-Set-Sizes-4-16/952146762?athcpid=952146762&athpgid=AthenaHomepageDesktop__gm__-1.0&athcgid=null&athznid=SeasonalCampaigns_d396fb61-c3c0-46db-a4d9-aaf34191b39f_items&athieid=null&athstid=CS020&athguid=kZNrXnatcjxcgUvbKkvbwYMT4bwAapwfOaos&athancid=null&athena=true&athbdg=L1400'
//(in this instance, the value I'm attempting to get is the "Now 24.99" portion)
fetch('http://localhost:8000/temp/' + url1)
.then(response => {return response.json()})
.then(data => {
console.log(data)
})
.catch(err => console.log(err))
To make it easier here is the HTML from the url
<span itemprop="price" aria-hidden="false">Now $24.97</span>
I see two options here:
If the /temp
(or /bids
) route is supposed to handle arbitrary URLs, the second option makes more sense. But if the contract with the client is that they're all the same sort of URL, then you can (and probably should) do the scraping on the server:
const axios = require("axios");
const cheerio = require("cheerio");
const app = require("express")();
const ua =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
app.get("/bids", (req, res) => {
axios
.get(req.query.url, {headers: {"User-Agent": ua}})
.then(({data: html}) => {
const $ = cheerio.load(html);
res.json({price: $("[itemprop='price']").text().trim()});
})
.catch(err => res.status(400).json({error: "Bad request"}));
});
app.listen(8000, () => console.log("listening on port 8000"));
A few things to note:
req.query
and pass a query string like ?url=https://www.example.com
. req.params
seems to be confused by paths like /bids/https://www.example.com
. It's possible but maybe ugly? You could also accept a POST JSON payload with the URL.[itemprop="price"]
is the CSS selector I'm using for your element. I'm nitpicking, but I'd say the value you want is inside the element rather than outside. It's the text content of the element (as opposed to an attribute, the foo="bar"
pairs).A general tip: try to decompose and minimize your problems and work on one at a time. Getting the URL to the server, making the request and parsing the HTML are all totally different steps. If you haven't validated that your URL parameter is coming through correctly, you might be confused if you're off working on selecting an element on a response that isn't what you think it is.
If you want to focus on Cheerio parsing, temporarily hardcode the HTML to simplify the problem space, avoiding bugs that could be elsewhere in the app getting in the way. Or hardcode the URL if you want to focus on making the request work.