Search code examples
sqlentity-framework-coreazure-sql-database

Is "not in list" faster than re-reading?


I have a map display of relevant events that exist in the bounds of the map. When the user zooms out, there are now more events, basically in a larger box that includes the already displayed inner box.

So it first shows downtown Denver. Zooms out and it's now Denver metro. Zooms out some more and we're now getting the rural areas around Denver. At this point 95% of what is being displayed in the larger box, was displayed in the smaller box.

The record being read has 72 columns, most strings of moderate length, 7 are datetime, about 10 are ints. The only byte[] is the RowVersion. And there's a geography column for the location in the map.

What is faster when I read for a new larger box, and for when the user scrolls sideways moving the box, but the new box includes part of the old box.

Say I have 1,000 records already read. And it will now read either 500 new records or 1,500 total records. Is it generally faster to do "not in list" and give it the list of 1,000 IDs (ints)? or faster to read all of them, including re-reading the ones I already have?

Update: Per @Charlieface's comment, I have the event location as a column in the record. It is a NetTopologySuite.Geometries.Point and I use it as follows:

public static IQueryable<Event> WhereWithinRange(this IQueryable<Event> source, Point? location, int distance)
{

    if (location == null)
        return source;

    var meters = MiscUtils.MilesToMeters(distance);
    var query = source
        .Where(e =>
            e.Address.Location == null
            || e.Address.Location.Distance(location) < meters);
    return query;
}

This is Entity Framework on Azure SQL Database if that matters.

ps - Yes I can test myself, but there are lies, damn, lies and statistics. I ask here for someone who knows SQL performance a lot better than me and has accurately tested this.


Solution

  • From a performance stand-point, querying speed would be as fast or faster performing the full retrieve compared to adding an exclusion list since the first part of the execution plan for an exclude query will still involve checking indexes/tables for what data initially needs to be included/considered. Where potential time savings would come in is with the transmission and materialization of the resulting data. Sending and constructing 1000 entities/view models will be a bit faster than 1500.

    However, one limitation you should consider is that there is a limit to how many PKs you can pass to use in the resulting NOT IN(...) expression. For SQL Server that is around 32k, while Oracle it appears to be 1000, so each DBMS likely has different limits. In either case with enough values the performance will likely degrade past any savings from transmission & materialization, and past a point it will result in an exception.

    To maximize performance querying sets of data like that, my recommendations would be:

    #1. Leverage projection to minimize data transmission/materialization: To render a map or summary list of locations, you almost certainly will not need 72 columns of data. Use Select() to cherry-pick just the columns you actually need, where you can fetch full locations by ID when needed. The added benefit of projection is that the data coming back is not cached/tracked by the DbContext. Fetching entities means by default the DbContext adds each to its change tracking cache. This bloats memory and impacts performance of all subsequent queries. Projections are not cached.

    #2. After establishing your projections, analyze the execution plan and tune the database indexing to ensure this query is running efficiently: A tip for working with EF querying like any database system is to inspect real-world execution and tune indexes based on what queries are running frequently and/or across large swaths of data.

    #3, When fetching full entities by ID for read-only details, use AsNoTracking(): When you need read access to entities themselves such as a detail view for a selected entry, read the entities without adding them to the tracking cache. Many times when developers are facing performance issues they are led to look at the change tracking cache and opt to disable it entirely "for performance". The change tracking cache is extremely useful for editing and should not be disabled entirely, hence read operations should use projection as much as possible to improve performance and in the few cases where entire entities are needed for read-only operations, use AsNoTracking() for that specific query.

    Loading a few thousand summary view models for map locations should prove no issue at all for EF and require no complex plan for factoring in exclusions. As a general recommendation I would avoid the temptation of premature optimization. Start with the simplest thing, evaluate, and tune if and when there is a problem. The important thing is to test with real-world scale data as early as possible to address issues, rather than trying to predict where problems might occur, which can lead to solutions in search of a problem, resulting in different, avoidable problems.