Search code examples
javascriptjquerynode.jsscreen-scrapingcheerio

Scraping with cheerio.js, getting: Error: Can only perform operation while paused


Trying to scrape the whiskey name, image_url, and description from this site: https://www.thewhiskyexchange.com/c/33/american-whiskey?filter=true#productlist-filter using cheerio.js. I want to turn that information into an array of JSON objects to store in my MongoDB. Can't show the entire html of the site, but here is a portion of the relevant basic structure of the unordered list:

<body>
  <div class="siteWrapper">
    <div class="wrapper">
      <div class="products-wrapper">
        <ul class="products-list">
          <li>
            <a>
              <div class="product-content">
                <div class="information">
                  <p class="name">
                    " Jack Daniel's Old No. 7"
                      <span>Small Bottle</span>
                  </p>
                </div>
              </div>
            </a>
          </li>
          <li></li>
          <li></li>   etc. </all closing tags>

Starting off just attempting to get the whiskey name in <p class="name">, without any text from <span> tags, I used this jQuery code in the browser console and it gets me exactly what I need:

$('ul.products-list > li').each(function(index) {
    const nameOnly = $(this).find('a div div.information p.name').first().contents().filter(function() {
        return this.nodeType == 3;
    }).text();
    const whiskeyObject = {name: nameOnly};
    const whiskeys = JSON.stringify(whiskeyObject);
    console.log(whiskeys);
})

Trying the same code in my app file (whiskey-scraper.js) with cheerio:

const express = require('express');
const request = require('request');
const cheerio = require('cheerio');
const fs = require('fs');
const app = express();
const port = 8000;

request('https://www.thewhiskyexchange.com/c/33/american-whiskey?filter=true#productlist-filter', function(error, response, body) {
  if(error) {
    console.log("Error: " + error);
  }
  console.log("Status code: " + response.statusCode);

  const $ = cheerio.load(body);
  // console.log(body);
  $('ul.products-list > li').each(function(index) {
  const nameOnly = $(this).find('a div div.information p.name').first().contents().filter(function() {
        return this.nodeType == 3;
      }).text().trim();
    const whiskeyObject = {name: nameOnly};
    const whiskeys = JSON.stringify(whiskeyObject);
    console.log(whiskeys);
  })
});

app.listen(port);
console.log(`Stuff is working on Port ${port}!`);

When I run node inspect whiskey-scraper.js in my terminal, the console logs a status code of 200, but also logs this error:

"Error: Can only perform operation while paused. - undefined
  at _pending.(anonymous function) (node-
inspect/lib/internal/inspect_client.js:243:27)
  at Client._handleChunk (node-inspect/lib/internal/inspect_client.js:213:11)
  at emitOne (events.js:96:13)
  at Socket.emit (events.js:191:7)
  at readableAddChunk (_stream_readable.js:178:18)
  at Socket.Readable.push (_stream_readable.js:136:10)
  at TCP.onread (net.js:561:20)"

Can't figure out what this means or how to work around this error. Any ideas on how to eliminate this error and at least get my console.log(whiskeys); line working? If I can get that working, I can take it from there.

When I uncomment console.log(body); I get the entire html for the site gets logged to the console, so I feel cheerio is getting the information I need from the site. Once I eliminate this error, I can figure out getting the image_url, the description, and getting it into my MongoDB.

Thank you!


Solution

  • Figured out the solution for this. For the website, you can display whiskeys and their information in a grid format or a list format - and they are the exact same URL. I was looking at the HTML for the list format, which uses the <ul><li> format, but cheerio chooses to import the grid format, where there is no unordered list, just multiple nested <div>s. Never even thought of that!