c#sql sqldatareader yield-return premature-optimization

Take(10) vs TOP 10 With SqlDataReader?

I have a method that reads data using SqlDataReader and yield returns an IEnumerable, e.g.:

IEnumerable<string> LoadCustomers()
{
 using(SqlDataReader rdr = cmd.ExecuteReader())
 {
    while (rdr.Read())
    {
        yield return rdr.GetString(0); 
    }
 }
}

Now let's assume I need only newest 10 customers. I could do

LoadCustomers.Take(10)

or pass 10 as a parameter to sql and make my sql

SELECT TOP 10 FROM Customers ORDER BY CreationDate DESC

According to this post the entire results set are being transmitted from sql server to the client even if the datareader reads only few rows (as long as the connection is open) - should I avoid the Take(10) approach because of that extra data being transmitted anyway to client or it would be a premature optimization to avoid it (because the yield return code would close the connection after it read 10 rows and then the data transmission would stop anyway)?

Solution

Since whether or not optimization is "premature" is subjective, I choose to interpret this question as "does using a DataReader and stopping the read after 10 rows have the same performance characteristics as using TOP(10) in the query?"

The answer is no. Passing TOP(10) to the server allows the optimizer to tune reads, memory grants, I/O buffers, lock granularity and parallelism with the knowledge that the query will return (and in this case, also read) at most 10 rows. Leaving out the TOP means it has to prepare for the case where the client is going to be reading all rows -- regardless of whether you actually stop earlier.

It isn't true that the server will send rows whether you read them or not. Pulling rows with a SqlDataReader is conceptually a row-by-row operation: when you issue Reader.MoveNext, you fetch the next row from the server and only that row. But in the interest of performance, rows are buffered before you request them (both on the server end an in the network buffers). So it is possible to end up with, say, 100 rows retrieved in buffers after your first .MoveNext call, even if you only read 10 of them.

With regards to overhead, this would not be my primary concern because these buffers ultimately have a fixed size: the server will not go and buffer all rows of a result set regardless of how many there are (this would be very inefficient in general). If you only read 10 rows, whether your query would ultimately return 1,000 or 1,000,000 rows if it ran to completion isn't going to matter in terms of buffering, but primarily in terms of the query plan. Nevertheless, it does add to the overhead.