I'm in need of a bit advice on how to best approach this problem. I inherited a project to build a reporting utility from an existing SQL server database. The database contains a "Raw Data" table where every production data point is dumped. The report needs to provide the average and Cpk of several of the data columns.
The "Raw Data" table contains 25 columns, including an ID
column, a LotID
column, and a TimeStamp
column, as well as 5 columns containing measurement data. Overall the table appears to have 20+ million records, and is essentially one huge flat file.
What I need to do is provide two methods to search; by date range, and by lot ID. One lot ID can hold as many as 2M records.
I started with developing a C# application that uses a simple SELECT query.
SELECT *
FROM tblRawData
WHERE [LotID] = "XXXX"
Then uses the SqlDataAdapter.Fill
function to populate a DataTable
. I also tried SqlDataReader
and looped to populate the results into the DataTable
.
The biggest problem I'm seeing is 90%+ memory usage (a bulk of it on the SQL Server process) with the occasional low memory warning and each query takes takes several minutes to run.
Not being a SQL Server guru, I'm looking for advice as to whether this approach seems reasonable, or should I be looking to do something different?
20 million records is not typically considered "a lot of data" unless your server is slow or you have text/blob/(n)varchar(max) data types in your dataset--which you should avoid, if possible. To clarify varchar(8000)/nvarchar(4000), or less, data types are OK, because they will not be treated as blob-style storage (much slower performance storage).
There are several ways to optimize your approach:
So, if you MUST pull a large amount of data into C#, you want to SELECT only on indexed fields, and only pull back the smallest dataset possible.
In my experience, all forms of pulling data into C# are fast. This includes SqlDataAdapter, SQLDataReader, and even Entity Framework's ORM. However, if the dataset you are pulling back is enormous, then you will certainly run out of memory on smaller boxes, and you will have to wait to move all of that data off of the disk--where the disk speed becomes a significant bottleneck in performance, in addition to any network latency. If you have access to the SQL server box's Resource Manager, you can see this in real time.