Search code examples
databasedatabase-designdatabase-normalizationauditdenormalization

Is it possible to store historical configuration settings for each row of data without cramming all the configuration settings into each row of data?


For background: I was recently hired as a database engineer for a water treatment company. We deploy water treatment machines to sites across the country, and the machines treat water and send continuous data back to us regarding the state of incoming water (flow rate, temperature, concentration of X in incoming water, etc.), and regarding the treatments the machine applied to that water at that point in time. Over time, sites (and their various components) change a lot: a machine might break down and need to be replaced, a different concentration of chemical may be used to fill the machine's tanks, its flow meters and other sensors might be recalibrated or set to scale differently, its chemical pumps might be replaced, and on and on. These affect the interpretation of the data: for example, if 5 mL of chlorine was added to the incoming water at 01/01/2021 12:00:05, that means two completely different things if the chlorine was 5% concentrated or 40% concentrated.

Water treatment datapoints are identified by a composite key consisting of the ID of the site, and a timestamp. It would be easy if the only data that mattered was current data, as I could just store the configuration settings on a Site level and pull them up for datapoints as needed. But we need to be able to correctly interpret older data. So, I thought about storing configurations in another table, trackingall the settings for each site over each time period, but it's not possible to create a foreign key between the continuous timestamps of the datapoints and the start/end dates of the configurations - the closest thing would be some kind of range check, like "Datapoint.TimeStamp BETWEEN Configuration.Start AND Configuration.End". So the only other option I see is to store every configuration setting for every datapoint alongside each datapoint, but that seems like a terrible solution given how many configuration settings there are and how many datapoints are generated, especially since most of the settings don't even change often.

So, is there a way to store historical configurations for each row of data in a way that is at all normalized, or is the only possible solution to cram all the settings into each datapoint?


Solution

  • The original answer that I accepted seems to have been deleted. For anyone coming here in the future, the solution that I intend to go with is as follows:

    I'm going to create a configuration table to hold settings in the following format:

    _SiteID_  _Start_                _End_                  <various settings fields>
    318       "2021-01-01 12:22:03"  "2021-02-10 09:08:26"  ...
    

    Where the primary key is (SiteID, Start, End). SiteID is a foreign key to the integer ID of the Site table, Start is the date at which the configuration starts being valid, and End (default: NULL) is the date at which the configuration is no longer valid. In order to keep things good and simple for users (and myself), and to prevent any accidental updates to old configuration settings when instead there should have been a new configuration row inserted, I'm going to disallow UPDATE and DELETE operations on the configuration table for all users except root, and instead create a stored procedure for "updating" the configuration of a given Site. The stored procedure will take whatever new parameters the user specified, copy in any parameters that the user DIDN'T specify from the most recent configuration for that Site (i.e., the row with the same SiteID and the NULL End date), overwrite the most recent configuration row's NULL End date to be the Start date for the new row, and finally create the new row with the specified Start date.

    NOTE: the Start date and End date are both stored for each configuration because configurations might not necessarily be continuous, i.e. it is not the case that "as soon as a configuration expired, there is another configuration that starts at the exact time that that configuration expired", as deployments of water treatment equipment sometimes have large gaps in between them if a client doesn't need our services for some period of time. Without storing the End dates for configurations too, we would have to assume that each configuration lasts until the next configuration begins, or until now, if there is no later configuration stored. So End date is stored so that we don't ever think "Site A was configured to have X Y Z settings from January 2020 to June 2021" when there hasn't even been a machine at Site A since May of 2020. Storing the End date explicitly alongside the Start date also avoids the ickiness of needing to rely on the values in other rows of configuration data to know how to interpret a given row of configuration data.

    Thank you to whoever it was who originally gave me the inspiration for this answer, I have no idea why your answer was deleted.