Search code examples
sqloptimizationlanguage-agnosticdata-retrieval

How to avoid many database round trips and a lot of irrelevant data?


I have worked with various applications and encountered this situation many times. Until now I have not figured out what is the best approach.

Here's the scenario:

  • I have an application either desktop or web
  • I need to retrieve simple documents from the database. The document has a general details and item details so the database:

GeneralDetails table:

| DocumentID | DateCreated | Owner     |
| 1          | 07/07/07    | Naruto    |
| 2          | 08/08/08    | Goku      |
| 3          | 09/09/09    | Taguro    |

ItemDetails table

| DocumentID | Item        | Quantity  |
| 1          | Marbles     | 20        |
| 1          | Cards       | 56        |
| 2          | Yo-yo       | 1         |
| 2          | Chess board | 3         |
| 2          | GI Joe      | 12        |
| 3          | Rubber Duck | 1         |

As you can see, the tables have a one-to-many relationship. Now, in order to retrieve all the documents and their respective items, I always do either of the two:

Method 1 - Many round trips (pseudo-code):

 Documents = GetFromDB("select DocumentID, Owner " +
                       "from GeneralDetails") 
 For Each Document in Documents
{
    Display(Document["CreatedBy"])
    DocumentItems = GetFromDB("select Item, Quantity " + 
                              "from ItemDetails " + 
                              "where DocumentID = " + Document["DocumentID"] + "")
    For Each DocumentItem in DocumentItems
    {
        Display(DocumentItem["Item"] + " " + DocumentItem["Quantity"])
    }
}

Method 2 - Much irrelevant data (pseudo-code):

DocumentsAndItems = GetFromDB("select g.DocumentID, g.Owner, i.Item, i.Quantity " + 
                              "from GeneralDetails as g " +
                              "inner join ItemDetails as i " +
                              "on g.DocumentID = i.DocumentID")
//Display...

I used the first method when I was in college for desktop applications, the performance was not bad so I realized it was okay.

Until one day, I saw an article "Make the web faster", it says that many round trips to the database is bad; so ever since then I have used the second method.

On the second method, I avoided round trips by using inner join to retrieve the first and the second table at once, but it produces unecessary or redundant data. See the result set.

| DocumentID | Owner     | Item        | Quantity  |
| 1          | Naruto    | Marbles     | 20        |
| 1          | Naruto    | Cards       | 56        |
| 2          | Goku      | Yo-yo       | 1         |
| 2          | Goku      | Chess board | 3         |
| 2          | Goku      | GI Joe      | 12        |
| 3          | Taguro    | Rubber Duck | 1         |

The result set has redundant DocumentID and Owner. It looks like an unnormalized database.

Now, the question is, how do I avoid round trips and at the same time avoid redundant data?


Solution

  • The method used by ActiveRecord and other ORMs is to select the first table, batch together the IDs and then use those IDs in an IN clause for the second select.

    SELECT * FROM ItemDetails WHERE DocumentId IN ( [Comma Separated List of IDs here] )

    Advantages:

    • No redundant data

    Disadvantages:

    • Two queries

    Generally speaking, the first method is referred to as the "N+1 query problem" and the solutions are referred to as "eager loading". I tend to see your "Method 2" as preferable as the latency to the database generally trumps the size of the redundant data over the data transfer rate, but YRMV. As with almost everything in software, it's a tradeoff.