Search code examples
sqldatabasenosqlscreen-scrapingweb-crawler

What database for crawler/scraper?


I am currently researching what database to use for a project I am working on. Hopefully you guys can give me some hints.

The project is an automated web crawler that checks websites as per a user's request, scrapes data under certain circumstances, and creates log files of what was done.

Requirements:

  • Only few tables with few columns; predefining columns is no problem
  • No overly complex associations between models
  • Huge amount of date & time based queries
  • Due to logging, database will grow rapidly and use up a lot of space
  • Should be able to scale over multiple servers
  • Fields contain mostly ids (int), strings (around 200-500 characters max), and unix timestamps
  • Two different types of servers will simultaneously read/write data directly to/from it:
    • One(/later more) rails app that takes user input and displays results upon request
    • One(/later more) Node.js server that functions as the executing crawler/scraper. It will have enough load to run continuously and make dozens of database queries every second.

I assume it will neither be a graph database (no complex associations), nor a memory based key/value store (too much data to hold in cached). I'm still on the fence for every other type of database I could find, each seems to have it's merits.

So, any advice from the pros how I should decide?


Solution

  • I would agree with Vladimir that you would want to consider a document-based database for this scenario. I am most familiar with MongoDB. My reasons for using it here are as follows:

    1. Your 'schema requirements' of "only a few tables with few columns" fits well with the NoSQL nature of MongoDB.
    2. Same as above for "no overly complex associations between nodes" -- you will want to decide whether you'd prefer nested documents or using dbref (I prefer the former)
    3. Huge amount of time-based data (and other scaling requirements) - MongoDB scales well via sharding or partitioning
    4. Read/write access - this is why I am recommending MongoDB over something like Hadoop. The interactive query requirement is best met by something other than a Hadoop-style store, as this type of storage is designed for batch (rather than interactive query) requirements.