I'm trying to redesign a low-moderate complexity database schema. I'm trying to decide how many foreign keys to include.
Is it generally better to have a great many foreign keys densely connecting entities/tables to each other, or is it preferable to have as few foreign keys connecting entities/tables together as possible? Where do you draw the line?
As noted in https://stackoverflow.com/a/1876062/12713117 "Foreign keys have less to do with joins than with keeping database integrity. Proof of that is that you can join tables in any way you want, even in ways that don't necessarily make sense."
In my case, there are a great many relationships between some but not all tables in my database schema. The graph resembles a dense web as opposed to an orderly hub-and-spoke or a similarly recognizable graph.
Many of the foreign key columns set up with Entity Framework are optional because data is imported piecemeal and strict requirements wouldn't allow me to perform each piece of the import process. The data eventually becomes consistent after all pieces of the import process are finished. This gave my problem an obvious solution: I set up foreign keys whenever I could without disrupting the import process.
Unfortunately, this lead to some tables being highly inter-connected and creating a code smell.
As I built the system, the number of table joins to complete my main use case - "get all data related needed to create an analysis" - has grown. I now have a large PostgreSQL View which encompasses many tables and is going to get bigger as I improve the system. I don't like having to maintain this extremely large query. Some pieces of data require only a single join, but others require joins on joins on joins. That seem like code smell for a system with this few tables.
This is a pharmacy benefit related system. There is standards-based data collected and mixed with internal data before being presented to a staff member in a large SQL query. The staff member analyzes this data and creates a Suggestion.
From a CRUD perspective, the system is simple. I see no reason why I can't have as many foreign keys as I want from a performance perspective.
From a Query perspective, the only difficult query is gathering all information to create an analysis. This is requiring several table joins as previously mentioned.
Discussed at end of question in "Update from comments".
Segment the data by source and write services and queries to pull data from those sources separately.
The problem is although the data is coming from different sources the data itself is inter-related across sources and there is a direct foreign key relationship between data from those sources.
This doesn't solve the code smell. All data is in the same domain.
I tried to divide the data where I put one table containing primarily foreign keys in the middle (a hub), and referenced other tables containing primarily foreign keys (a hub), and attached these tables to the hubs and the hubs to each other. This was done to reduce the number of foreign keys in each table.
This didn't work because relationships between entities in different hubs were stronger than the relationships between entities within the same hubs. Claim-Prescription-Pharmacy-Prescriber-Plan-PlanMember-Formulary-FormularyRow-Analysis-Suggestion are all inherently related so I would need a single hub and put all of these entities as spokes on that hub. That's dumb.
I have a messy graph of entity relationships. I think this is producing code smell in queries, but perhaps I should just accept it?
A commenter asked for information about relationships between entities.
Entity | Relationship | Other Entity |
---|---|---|
Claim | many : 1 | Prescription |
Claim | many : 1 | Pharmacy |
Claim | many : 1 | Prescriber |
Claim | many : 1 | Plan |
Claim | many : 1 | Plan Member |
Claim | many : 1 | Formulary |
Claim | many : 1 | Formulary Row |
Claim | many : many | Analysis |
Claim | many : many | Suggestion |
Prescription | many : 1 | Prescriber |
Prescription | many : 1 | Plan Member |
Prescription | many : 1 | Formulary Row |
Prescription | many : many | Analysis |
Prescription | many : many | Suggestion |
Pharmacy | many : many | Plan |
Pharmacy | many : many | Analysis |
Pharmacy | many : many | Suggestion |
Prescriber | many : many | Analysis |
Prescriber | 1 : many | Suggestion |
Plan | 1 : many | Plan Member |
Plan | 1 : 1 | Formulary |
Plan | 1 : many | Analysis |
Plan | 1 : many | Suggestion |
Plan Member | many : 1 | Formulary |
Plan Member | 1 : many | Analysis |
Plan Member | 1 : many | Suggestion |
Formulary | 1 : many | Formulary Row |
Formulary | 1 : many | Analysis |
Formulary | 1 : many | Suggestion |
Formulary Row | many : many | Analysis |
Formulary Row | many : many | Suggestion |
Analysis | 1 : many | Suggestion |
"Joins on joins on joins" is not a code smell, it is normalization. While de-normalizing a schema into a "web" with FKs duplicated across tables that may benefit from a relationship may simplify the query you read, it introduces much more serious problems in introducing multiple sources of truth that can become inconsistent.
Take a relationship between Claim, Prescription, and Prescriber.
Option 1: Normalized Prescription has a ClaimId and a PrescriberId. If I am loading a Claim and I want the Prescriber, I go through the Prescription. (Claim.Prescription.Prescriber)
Option 2: Denormalized Prescription has a ClaimId and PrescriberId, and Prescriber has a ClaimId. If I am loading a Claim and I want the Prescriber, I can shortcut to join directly to the Prescriber.
The problem now is that I have two routes between Claim and Prescriber. I can go Claim.Prescriber or Claim.Prescription.Prescriber. What ensures that the Prescriber.ClaimId will always match the Prescription.ClaimId for Prescriptions that join to this Prescriber?
I honestly think that you are going way, way overboard in overthinking the design of this relatively simple system. Database engines are designed to index and resolve data extremely fast and efficiently. It's not something you need to worry about optimizing for unless you're building a system big enough to need to worry about it.
As far as writing queries goes, I would highly recommend learning about navigation properties, and then leverage EF to turn Linq expressions into flattened models suited to what you want to display or work with using Select
. EF/Linq is not a C# substitute for writing SQL in that you don't need to manually "Join" entities together. Navigation properties allow EF to work out the joins entirely in the background. Join
is provided for those edge cases where there is a relationship between entities that cannot be expressed with a FK. (I.e. OwnerType + OwnerId for anti-pattern tables that can reference 2 or more other tables via a single ID column)