Trying to scrape the whiskey name, image_url, and description from this site: https://www.thewhiskyexchange.com/c/33/american-whiskey?filter=true#productlist-filter using cheerio.js. I want to turn that information into an array of JSON objects to store in my MongoDB. Can't show the entire html of the site, but here is a portion of the relevant basic structure of the unordered list:
<body>
<div class="siteWrapper">
<div class="wrapper">
<div class="products-wrapper">
<ul class="products-list">
<li>
<a>
<div class="product-content">
<div class="information">
<p class="name">
" Jack Daniel's Old No. 7"
<span>Small Bottle</span>
</p>
</div>
</div>
</a>
</li>
<li></li>
<li></li> etc. </all closing tags>
Starting off just attempting to get the whiskey name in <p class="name">
, without any text from <span>
tags, I used this jQuery code in the browser console and it gets me exactly what I need:
$('ul.products-list > li').each(function(index) {
const nameOnly = $(this).find('a div div.information p.name').first().contents().filter(function() {
return this.nodeType == 3;
}).text();
const whiskeyObject = {name: nameOnly};
const whiskeys = JSON.stringify(whiskeyObject);
console.log(whiskeys);
})
Trying the same code in my app file (whiskey-scraper.js) with cheerio:
const express = require('express');
const request = require('request');
const cheerio = require('cheerio');
const fs = require('fs');
const app = express();
const port = 8000;
request('https://www.thewhiskyexchange.com/c/33/american-whiskey?filter=true#productlist-filter', function(error, response, body) {
if(error) {
console.log("Error: " + error);
}
console.log("Status code: " + response.statusCode);
const $ = cheerio.load(body);
// console.log(body);
$('ul.products-list > li').each(function(index) {
const nameOnly = $(this).find('a div div.information p.name').first().contents().filter(function() {
return this.nodeType == 3;
}).text().trim();
const whiskeyObject = {name: nameOnly};
const whiskeys = JSON.stringify(whiskeyObject);
console.log(whiskeys);
})
});
app.listen(port);
console.log(`Stuff is working on Port ${port}!`);
When I run node inspect whiskey-scraper.js
in my terminal, the console logs a status code of 200, but also logs this error:
"Error: Can only perform operation while paused. - undefined
at _pending.(anonymous function) (node-
inspect/lib/internal/inspect_client.js:243:27)
at Client._handleChunk (node-inspect/lib/internal/inspect_client.js:213:11)
at emitOne (events.js:96:13)
at Socket.emit (events.js:191:7)
at readableAddChunk (_stream_readable.js:178:18)
at Socket.Readable.push (_stream_readable.js:136:10)
at TCP.onread (net.js:561:20)"
Can't figure out what this means or how to work around this error. Any ideas on how to eliminate this error and at least get my console.log(whiskeys);
line working? If I can get that working, I can take it from there.
When I uncomment console.log(body);
I get the entire html for the site gets logged to the console, so I feel cheerio is getting the information I need from the site. Once I eliminate this error, I can figure out getting the image_url, the description, and getting it into my MongoDB.
Thank you!
Figured out the solution for this. For the website, you can display whiskeys and their information in a grid format or a list format - and they are the exact same URL. I was looking at the HTML for the list format, which uses the <ul><li>
format, but cheerio chooses to import the grid format, where there is no unordered list, just multiple nested <div>
s. Never even thought of that!