Search code examples
databasecouchbasedatabase-performancedenormalization

Denormalizing and querying a giant Couchbase data set


Let's say I have about 1Tb of SQL email data that has the following fields:

email, email_host, username, first_name, last_name, last_login

I'm told that the MySQL queries like SELECT * FROM emails WHERE email_host = 'gmail.com' are running a bit slow... we're talking 30 minutes or more.

I'd like to try out Couchbase but I'm not sure how to go about denormalizing the data and building the views. It seems that one possible approach as per Couchbase's own video tutorial sessions is to create a "key," like:

u::gmail::incremented_id

or

u::john::incremented_id

then only fetch all the u::gmails to have a list of keys corresponding to gmail addresses, instead of querying all of the documents and checking if the email_host attribute is gmail.

Is this a legitimate / good strategy to go about denormalizing this sample data set and if so, what is the proper technique to create a view to preindex u::whatever?


Solution

  • You just need a map function like this:

    function(doc) {
      if (doc.type == 'user')
        emit(doc.email_host, doc);
    }
    

    Then you query the view using as key gmail.com to get all the users using gmail. If the database contains other document types, you better include a type attribute in your documents to distinguish different types.