Let us say I have the following set of classes :
public class MegaBookCorporation
{
public int ID { get; private set}
public int BooksInStock
{
get
{
return Stores.Sum( x => x.BooksInStock)
}
}
public virtual ICollection<MegaBookCorporationStore> Stores { get; set; }
}
public class MegaBookCorporationStore
{
public int ID { get; private set; }
public string BookStoreName { get; private get; }
public virtual MegaBookCorporation ManagingCorporation { get; private set;}
public int BooksInStock
{
get
{
return Books.Where( x=> !x.IsSold).Count();
}
}
public virtual ICollection<Book> Books { get; set; }
}
public class Book
{
public int IndividualBookTrackerID { get; private set; }
public virtual MegaBookCorporationStore { get; private set; }
public bool IsSold { get; private set; }
public DateTime? SellingDate { get; private set;}
}
I had a discussion at work regarding the performance hit involved when retrieving the NumberOfBooks in a MegaBookCorporation. Two important facts :
1/ We're using EF 6 with Lazy Loading as suggested by the virtual keywords.
2/ Since every book is tracked individually the number of Book entries in the database will become great quickly. The table will likely have a size of hundreds of millions on the long run. We will perhaps be adding up to 100,000 books per day.
The opinion I supported is that the current implementation is fine and that we're not going to run into problems. My understanding is that a SQL statement would be generated to filter the collection when GetEnumerator is called.
The other suggestion made by my coworker is to cache the number of books. That means updating a field "int ComputedNumberOfBooks" whenever the AddBookToStock() or SellBook() methods would be called. This field would need to be repeated and updated in both the Store and Corporation classes. (Then of course we would need to take care of concurrency)
I know adding these fields wouldn't be a big deal, but I really feel bad about this idea. To me it looks like pre-engineering a problem that doesn't exist, and that in my opinion won't exist.
I decided to check again my claims with SO and found 2 contradicting answers :
One saying that the whole Books collection would be pulled to memory, since ICollection only inherits from IEnumerable. The other saying the opposite : the navigation property will be treated as an IQueryable until it is evaluated.(Why not since the property is wrapped by a proxy)
So here are my questions :
1- What is the truth ?
2- Even if the whole collection is referenced, don't you think that it's not a big deal since it would be an IEnumerable (low memory usage).
3- What do you think of the memory consumption / performance hit on this example, and what would be the best way to go ?
Thank you
The truth is that with the properties you defined the whole collection of books is loaded. Here's why.
Ideally, you want to be able to do
var numberOfBooks = context.MegaBookCorporations
.Where(m => m.ID == someId)
.Select(m => m.BooksInStock)
.Single();
If EF would be able to turn this into SQL, you'd have a query that only returns an integer and loads no entities into memory whatsoever.
But, unfortunately, EF can't do this. It will throw an exception that there is no SQL translation for BooksInStock
.
To circumvent this exception you could do:
var numberOfBooks = context.MegaBookCorporations
.Where(m => m.ID == someId)
.Single()
.BooksInStock;
This dramatically changes things. Single()
draws one MegaBookCorporation
into memory. Accessing its BooksInStock
property triggers lazy loading of MegaBookCorporation.Stores
. Subsequently, for each Store
the complete Books
collections are loaded. Finally, the LINQ operations (x => !x.IsSold
, Count
, Sum
) are applied in memory.
So in this case, the first link is correct. Lazy loading always loads complete collections. Once the collections are loaded, they will not be loaded again.
But the second link is correct too :).
As long as you manage to do everything in one LINQ statement that can be translated into SQL, the navigation properties and predicates will be evaluated in the database and no lazy loading will occur. But then you can't use the BooksInStock
properties.
The only way to achieve this is by a LINQ statement like
var numberOfBooks = context.MegaBookCorporations
.Where(m => m.ID == someId)
.SelectMany(m => m.Stores)
.SelectMany(s => s.Books)
.Count();
This executes a pretty efficient query with one join and a COUNT
, returning only the count.
So unfortunately, your key assumption...
that a SQL statement would be generated to filter the collection when GetEnumerator is called.
Is not entirely correct. A SQL statement is generated, but not including the filter. With the numbers of books you mention this will cause severe performance and memory problems.
Something should be done if you need these counts frequently and you don't want to query them separately all the time. Your coworker's idea, a redundant ComputedNumberOfBooks
field in the database could be a solution, but I share your objections.
Redundancy should be avoided at (nearly) all costs. The worst part is that it always requires a client application to keep both sides in sync. Or database triggers.
But talking about the database... If these counts are important and frequently queried, I would introduce a computed column BooksInStock
in the MegaBookCorporationStore
table. Its formula could simply do the count of books in store. Then you can add this computed column to your entity as a property that is marked as DatabaseGeneratedOption.Computed
. No redundancy.