Search code examples
c#sqlsql-serverlarge-data

How to pull large SQL server tables into C# for analysis


I'm in need of a bit advice on how to best approach this problem. I inherited a project to build a reporting utility from an existing SQL server database. The database contains a "Raw Data" table where every production data point is dumped. The report needs to provide the average and Cpk of several of the data columns.

The "Raw Data" table contains 25 columns, including an ID column, a LotID column, and a TimeStamp column, as well as 5 columns containing measurement data. Overall the table appears to have 20+ million records, and is essentially one huge flat file.

What I need to do is provide two methods to search; by date range, and by lot ID. One lot ID can hold as many as 2M records.

I started with developing a C# application that uses a simple SELECT query.

SELECT * 
FROM tblRawData 
WHERE [LotID] = "XXXX"

Then uses the SqlDataAdapter.Fill function to populate a DataTable. I also tried SqlDataReader and looped to populate the results into the DataTable.

The biggest problem I'm seeing is 90%+ memory usage (a bulk of it on the SQL Server process) with the occasional low memory warning and each query takes takes several minutes to run.

Not being a SQL Server guru, I'm looking for advice as to whether this approach seems reasonable, or should I be looking to do something different?


Solution

  • 20 million records is not typically considered "a lot of data" unless your server is slow or you have text/blob/(n)varchar(max) data types in your dataset--which you should avoid, if possible. To clarify varchar(8000)/nvarchar(4000), or less, data types are OK, because they will not be treated as blob-style storage (much slower performance storage).

    There are several ways to optimize your approach:

    1. DO NOT "SELECT *". Only pull back the fields you need, which will reduce the amount of time "over the wire", where your data gets pulled from the SQL server and is moved into your C# application.
    2. Do the processing on the SQL server. The performance of a SQL server tends to be high, although not always as high as C#. If your application only needs the answer, consider using the built-in AVG() function for averaging. And while I've never done Cpk, there may be a way to do that in SQL, as well. Also, you can do date ranges with the BETWEEN keyword.
    3. Use reasonable INDEXing. Unfortunately, proper indexing is almost an art. Essentially, use as few indexes as possible. Always have one primary clustered index, and then targeted non-clustered indexes for important data aggregation. Indexes slow down INSERT, UPDATE, and DELETE operations, while increasing performance (sometimes) on SELECTs. In your case, you might want an index on the LotID, or a combination of the LotID and the timestamps/date fields.
    4. Chunk your data. Only pull a reasonable number of rows out at once, if that's feasible. In many cases, this is not feasible, but keep it open as an option. You can chunk data in loops, or pull the data into a separate structure, like an in-memory temp table (denoted as @tableName), or an on-server temp table (denoted as #tableName). Each have advantages and disadvantages. On-server temp tables would probably fit your problem better, as they wouldn't use up so much memory.
    5. If you are using a newer version of SQL Server Management Studio, there is a Query Analyzer/Optimizer built-in. Other major tools often have this feature, as well. It can tell you where all of your time is being sent, and often suggest an INDEX to use.

    So, if you MUST pull a large amount of data into C#, you want to SELECT only on indexed fields, and only pull back the smallest dataset possible.

    In my experience, all forms of pulling data into C# are fast. This includes SqlDataAdapter, SQLDataReader, and even Entity Framework's ORM. However, if the dataset you are pulling back is enormous, then you will certainly run out of memory on smaller boxes, and you will have to wait to move all of that data off of the disk--where the disk speed becomes a significant bottleneck in performance, in addition to any network latency. If you have access to the SQL server box's Resource Manager, you can see this in real time.