Search code examples
sql-serverdatabasesql-server-2012normalization

Should I normalize a 1.3 million record flat file for analysis?


I've been handed an immense flat file of health insurance claims data. It contains 1.3 million rows and 154 columns. I need to do a bunch of different analyses on these data. This will be in SQL Server 2012.

The file has 25 columns for diagnosis codes (DIAG_CD01 through DIAG_CD_25), 8 for billing codes (ICD_CD1 through ICD_CD8), and 4 for procedure modifier codes (MODR_CD1 through MODR_CD4). It looks like it was dumped from a relational database. The billing and diagnosis codes are going to be the basis for much of the analysis.

So my question is whether I should split the file into a mock relational database. Writing analysis queries on a table like this will be a nightmare. If I split it into a parent table and three child tables (Diagnoses, Modifiers, and Bill_codes) my query code will much easier. But if I do that I'll have, on top of the 1.3 million parent records, up to 32.5 million diagnosis records, up to 10.4 million billing code records, and up to 5.2 million modifier records. On the other hand, a huge portion of the flat data of the three sets is null fields, which are supposed to screw up query performance.

What are the likely performance consequences of querying these data as a mock relational database vs. as the giant flat file? Reading about normalization it sounds like performance should be better, but the sheer number of records in a four table split gives me pause.


Solution

  • Seems like if you keep it denormalized you would have to repeat query logic a whole bunch of times (25 for Diagnoses), and even worse, you have to somehow aggregate all those pieces together.

    Do like you suggested and split the data into logical tables like Diagnosis Codes, Billing Codes, etc. and your queries will be much easier to handle.

    If you have a decent machine these row counts should not be a performance problem for sql server. Just make sure you have indexes to help with your joins, etc.

    Good luck!