database-design orm null relational-database non-relational-database

Structuring a "sparse" data table + ORM ( + NoSQL?)

Working on a piece of software for thermodynamic simulation (in C++), I need to store fluid properties at different temperatures. These data points are used to build a simple interpolation function, in order to evaluate the values of the properties also in temperatures for which we don't have full experimental data.

A fluid is simply identified by its name and concentration (%) (the latter often irrelevant). If you're curious, four properties are of interest: mass density, dynamic viscosity, specific heat capacity and thermal conductivity. For any other intent and purpose, these are just 4 numeric properties, call them A, B, C an D. So the function for a specific fluid looks something like this: (A,B,C,D) = f(T), where T is the temperature.

Currently, it is a SQLite3 database and the table of fluids looks like this:

+----+-------+---------------+
| id | name  | concentration |
+====+=======+===============+
|  1 | Water |           100 |
+----+-------+---------------+
|  2 | ..... |           ... |

And there is the properties table:

+----------+-------------+---------+-----------+--------------+----------+
| fluid_id | temperature | density | viscosity | conductivity | capacity |
+==========+=============+=========+===========+==============+==========+
|        2 | 373.15      | 1045.48 | 0.412     | 1.415        | 0.845    |
|        3 | 273.15      | 1105.0  | 2.113     | 0.4688       | 0.849    |
|        3 | 283.15      |         | 1.678     | 0.4859       | 0.8503   |
|        3 | 293.15      | 1098.0  | 1.353     | 0.5015       | 0.5833   |
|        3 | 303.15      |         | 1.08      | 0.5164       |          |
|        3 | 313.15      | 1090.0  | 0.893     | 0.532        | 0.8561   |
|        3 | 323.15      |         | 0.748     | 0.5432       |          |
|        3 | 333.15      | 1080.0  | 0.644     | 0.5543       | 0.8577   |
|        3 | 343.15      |         | 0.563     | 0.564        |          |
|        3 | 353.15      | 1068.0  | 0.499     | 0.5722       | 0.8612   |
|        3 | 363.15      |         | 0.44      | 0.5796       |          |
|        3 | 373.15      | 1054.0  | 0.39      | 0.5856       |          |
+----------+-------------+---------+-----------+--------------+----------+

Inserting data manually for testing was OK. This would also be an intuitive display for a Fluid Editor GUI later on.

In code, however, the interpolation is done separately for each property. Moreover, since I can't use the NULL values, not all temperatures (rows) are relevant for all properties. To adapt things to the code's point of view, I created four identical views - one for each property. For example:

+----+-----------+---------------+-------------+-------+
| id |   name    | concentration | temperature | value |
+====+===========+===============+=============+=======+
|  2 | Sea Water |            22 | 373.15      | 0.412 |
|  3 | Sea Water |            14 | 273.15      | 2.113 |
|  3 | Sea Water |            14 | 283.15      | 1.678 |
|  3 | Sea Water |            14 | 293.15      | 1.353 |
|  3 | Sea Water |            14 | 303.15      | 1.08  |
|  3 | Sea Water |            14 | 313.15      | 0.893 |
|  3 | Sea Water |            14 | 323.15      | 0.748 |
|  3 | Sea Water |            14 | 333.15      | 0.644 |
|  3 | Sea Water |            14 | 343.15      | 0.563 |
|  3 | Sea Water |            14 | 353.15      | 0.499 |
|  3 | Sea Water |            14 | 363.15      | 0.44  |
|  3 | Sea Water |            14 | 373.15      | 0.39  |
+----+-----------+---------------+-------------+-------+

Now as I'm gradually moving over from prototyping to building the proper software, I'm trying to think how either of these approaches would fit into an ORM perspective. Would it be a model for each property (like my views) or a single model for all properties (like the table currently in use). A third alternative might be to leave the database as it is and build models on top of views (instead of real tables), but this isn't the way of ORMs.

I even considered moving this dataset to a NoSQL solution (e.g. MongoDb), but I couldn't think of a way to overcome the double-perspective issue.

I admit that there is neither runtime nor space performance issue here, and that the volume of data to be stored and processed is negligible anyway. There might be as low as two queries in a hour, each loading the dataset for a specific fluid into a application's memory and working with it there (interpolation and calculations based on evaluations). So I'd accept it if you think I'm over-stressing about this.

Otherwise, I'd like to hear your thoughts and consider any different approach you might offer. Am I missing something? What about the redundancy of keys (fluid and temperature) that splitting the table would incur? Also, this might be of interest to others who do have set constraints.

Solution

Every table with NULL indicating a "missing" ("unknown" or "inapplicable") value corresponds to a schema dropping the NULLable column and introducing another table with some (super)key of the original and holding only rows not NULL in that column in it.

Nulls support easier human simultaneous reading of these separate same-(super)key tables. But they complicate everything else including the meaning of a table (the criterion/predicate by which a row goes in or stays out), hence base design and query composition. Usually we remove any NULLs as soon as possible that aren't pasted to a key for being shown with it on final output. (Eg OUTER JOIN is often used idiomatically to introduce then remove NULLs to express NULL-free EXCEPT/MINUS.) We use the separate tables' meanings both in expressing the meaning of the table with NULLs and in the meaning of queries using it. We extract the separate tables from the combined NULLed view for basically any processing.

The straightforward relational solution is the separate tables. The SQL tradition is for the one table. Other SQL justifications are reduced JOINs from combining the tables. But that's only for the final human-readable output! To process the one table you have to extract out the separate views anyway.

PS
Re "the redundancy of keys (fluid and temperature) that splitting the table would incur": There is no "redundancy" in having the same (subrow) value(s) appear more than once in a column, table or database. Redundancy occurs when two base tables or rows make the same assertion. A row substituted into a criterion/predicate gives a statement/proposition; a present row asserts its and an absent row asserts NOT its; a table (proposition) asserts the conjunction of its present and absent rows' assertions. (And a database (proposition) asserts the conjunction of its tables' propositions.) Moreover redundancy isn't necessarily bad; there is always an engineering tradeoff among time, space and complexity.

PPS
ORMs have views: deferred or lazy query evaluation. Since you can later use such a query within another, it acts just like a view. It is a named representation of a query in the ORM query language rather than the DBMS query language. This can conceivably be for an update command (when possible) as well as querying. That is up to the DBMS/ORM. However view update is basically a convenience, since it can always be expressed in terms updates to the constituent tables.