Search code examples
node.jsmongooseweb-scrapingpuppeteermongoose-schema

How to check from Mongoose timestamps if the document already exists?


I'm building a web scraper with node.js + puppeteer + mongoose. I'm getting the data from the web page and I'm able to save it to the database. Next step is to be able to check if the document already exists in the database. Been searching and trying many approaches without succeeding. Here is the part of my code what saves the data to the db:

try {
          const newCar = new Car({
            make: make,
            model: model,
            year: year,
            km: km,
            price: price
          });

          let saveCar = await newCar.save();
          console.log(saveCar); 
          console.log('car saved!');
        } catch (err) {
          console.log('err' + err);
        }

In my Schema, I've added the timestamps option:

const mongoose = require('mongoose');

const Schema = mongoose.Schema;

const carSchema = new Schema({
  make: {
    type: String
  },
  model: {
    type: String
  },
  year: {
    type: String
  },
  km: {
    type: String
  },
  price: String

}, {timestamps: true });

module.exports = mongoose.model('Car', carSchema);

So I hope someone could push me in the right direction with this. Is there a way to use the createdAt timestamp to check if document already is in the database and skip that when scraping?

EDIT. I've been trying to solve this using that hash. This is my code:

const hash = md5(assetsUrl);
const existingCar = Car.find({
          'hash': { $exists: true }
        });

        if (!existingCar) {
        try {
            const newCar = new Car({
              make: make,
              model: model,
              year: year,
              km: kmInt,
              price: priceInt,
              currency: currencyString,
              carUrl: carUrl,
              imageUrl: imageUrls,
              hash: hash
            });

            let saveCar = await newCar.save();
            console.log(saveCar);
            console.log('car saved!');
          } catch (err) {
            console.log('err' + err);
          }
          } else {
            console.log('car already in db');

          }

This doesn't work, the code falls to the else block every time. What am I missing here?


Solution

  • There is a lot of possible ways to handle your case:
    1.Create unique index on record here is more which will verify data exclusiveness in db. In your case, it means you can skip additional logic and keep parsing already saved documents because no data would be doubled.
    2. Create hash of page every time you visit it, and store hash in database. More could be found here or here . In your particular case you can create hash of page on first visit and then verify if content has changed from hash in database. If so, make your parsing, if don't, just skip page.
    3. If you just want to verify if you don't have same data in database and don't want to add unique index, you have to first findOne for the same data in database. More could be found here