I'm trying to make a database that relates the production and use of a product. The production involves 3 different steps and each steps gives it a batch number. I want to be able to look at the final product and write a query for finding the enz_name (step 0) from steps 4 and above (the steps in which the product is used).
The blue in this picture how something with the same enz_name, and batch numbers for the firs two steps can be treated as two different end products because it gets split up as 2 different batches at the last production step.
My objective: to be able to look at data from steps 4 and above (data from using the final product) and be able to track down the enz_name of the relevant final products.
My problem: the way i have it set up now is I would have to join tables from step 4 to step 3, step 3 to step 2, step 2 to step 1, and finally step 1 to the construct (step 0). That many steps seems inconvenient. And i'd basically be asking Oracle to search through over half of my tables to get an answer. Is that bad practice and bad design?
Possible solutions i'm thinking of:
I'm thinking i could make an intermediary table with enz_name and construct_id. But that doesn't really solve the problem of looking at the CELL tables and wanting to know the enz_name because I's still have to go through CHAR_ENZ, PURIFIED_ENZ, PRODUCED, CONSTRUCT to get to the intermediary table. Should I make a second intermediary table then that contains construct_id, g_batch, p_batch, and char_id? Is that really stupid?
I could just add a FK construct_id column to the char_enz table (step 3 of production) that references the construct table (step 0). This seems like the most obvious solution, but is there a better way? Is this a good solution?
I could just add a FK construct_id column to the char_enz table
That is called "denormalizing", as you may know. It comes with a few problems.
You are storing multiple copies of the same information. That is, I can look in CHAR_ENZ
to find the enzyme name or do the 6 table join. The extra storage is wasteful (but really, who cares?) but also now opens up the possibility of inconsistencies. Like the old saying goes: "a man with a watch knows what time it is; a man with two is never really sure".
It complicates updates. If you ever wanted to update the enzyme name on a batch, instead of just updating, say PRODUCED.CONSTRUCT_ID
, now you also need to remember to update all the enzyme names in CHAR_ENZ
as well.
Denormalization is not necessarily the devil, but you should only use it to avoid worse problems. A 6 table join is probably not a "worse" problem bad enough to justify it (obviously in my opinion, with few details).
In fact, personally, I would probably just go with the 6 table join. You could consider placing the PRODUCED
, PURIFIED_ENZ
, and CHAR_ENZ
tables in a cluster during your physical database design. That should minimize performance impacts. You can also make a view to encapsulate the join logic.
If you want to denormalize, you can choose not to denormalize "all the way". That is, for example, you can use composite keys to sort of denormalize but still benefit from data integrity constraints. E.g.,
PURIFIED_ENZ
==> primary key (G_BATCH, P_BATCH)
CHAR_ENZ
==> primary key (G_BATCH, P_BATCH, CHAR_ID)
This is also a violation of third normal form, if you want to call the police. But it makes it impossible for there to be a mismatch between a CHAR_ENZ
row and the "denormalized" G_BATCH
column. You can use the G_BATCH
to cut down the size of the join needed to get the enzyme name.
One last thing that concerns me about your data model is that it seems to be assuming and requiring that steps 1-3 of production will always be the same. Technologies change; customer requirements change. I know very little about microbiology, but as a rule, I would want to build a data model with fewer assumptions. You could do this by abstracting the batch steps a bit.