database postgresql design-patterns database-design entity-framework-core

Foreign key usage in low-moderate complexity database

I'm trying to redesign a low-moderate complexity database schema. I'm trying to decide how many foreign keys to include.

Is it generally better to have a great many foreign keys densely connecting entities/tables to each other, or is it preferable to have as few foreign keys connecting entities/tables together as possible? Where do you draw the line?

As noted in https://stackoverflow.com/a/1876062/12713117 "Foreign keys have less to do with joins than with keeping database integrity. Proof of that is that you can join tables in any way you want, even in ways that don't necessarily make sense."

In my case, there are a great many relationships between some but not all tables in my database schema. The graph resembles a dense web as opposed to an orderly hub-and-spoke or a similarly recognizable graph.

Many of the foreign key columns set up with Entity Framework are optional because data is imported piecemeal and strict requirements wouldn't allow me to perform each piece of the import process. The data eventually becomes consistent after all pieces of the import process are finished. This gave my problem an obvious solution: I set up foreign keys whenever I could without disrupting the import process.

Unfortunately, this lead to some tables being highly inter-connected and creating a code smell.

Code smell problem

As I built the system, the number of table joins to complete my main use case - "get all data related needed to create an analysis" - has grown. I now have a large PostgreSQL View which encompasses many tables and is going to get bigger as I improve the system. I don't like having to maintain this extremely large query. Some pieces of data require only a single join, but others require joins on joins on joins. That seem like code smell for a system with this few tables.

The system

This is a pharmacy benefit related system. There is standards-based data collected and mixed with internal data before being presented to a staff member in a large SQL query. The staff member analyzes this data and creates a Suggestion.

From a CRUD perspective, the system is simple. I see no reason why I can't have as many foreign keys as I want from a performance perspective.

From a Query perspective, the only difficult query is gathering all information to create an analysis. This is requiring several table joins as previously mentioned.

Tables

Discussed at end of question in "Update from comments".

Potential Solutions

Input-driven Approach

Segment the data by source and write services and queries to pull data from those sources separately.

The problem is although the data is coming from different sources the data itself is inter-related across sources and there is a direct foreign key relationship between data from those sources.

Domain driven approach

This doesn't solve the code smell. All data is in the same domain.

Hub and Spoke

I tried to divide the data where I put one table containing primarily foreign keys in the middle (a hub), and referenced other tables containing primarily foreign keys (a hub), and attached these tables to the hubs and the hubs to each other. This was done to reduce the number of foreign keys in each table.

This didn't work because relationships between entities in different hubs were stronger than the relationships between entities within the same hubs. Claim-Prescription-Pharmacy-Prescriber-Plan-PlanMember-Formulary-FormularyRow-Analysis-Suggestion are all inherently related so I would need a single hub and put all of these entities as spokes on that hub. That's dumb.

Messy graph

I have a messy graph of entity relationships. I think this is producing code smell in queries, but perhaps I should just accept it?

Update from comments

A commenter asked for information about relationships between entities.

Entity	Relationship	Other Entity
Claim	many : 1	Prescription
Claim	many : 1	Pharmacy
Claim	many : 1	Prescriber
Claim	many : 1	Plan
Claim	many : 1	Plan Member
Claim	many : 1	Formulary
Claim	many : 1	Formulary Row
Claim	many : many	Analysis
Claim	many : many	Suggestion
Prescription	many : 1	Prescriber
Prescription	many : 1	Plan Member
Prescription	many : 1	Formulary Row
Prescription	many : many	Analysis
Prescription	many : many	Suggestion
Pharmacy	many : many	Plan
Pharmacy	many : many	Analysis
Pharmacy	many : many	Suggestion
Prescriber	many : many	Analysis
Prescriber	1 : many	Suggestion
Plan	1 : many	Plan Member
Plan	1 : 1	Formulary
Plan	1 : many	Analysis
Plan	1 : many	Suggestion
Plan Member	many : 1	Formulary
Plan Member	1 : many	Analysis
Plan Member	1 : many	Suggestion
Formulary	1 : many	Formulary Row
Formulary	1 : many	Analysis
Formulary	1 : many	Suggestion
Formulary Row	many : many	Analysis
Formulary Row	many : many	Suggestion
Analysis	1 : many	Suggestion

Solution

"Joins on joins on joins" is not a code smell, it is normalization. While de-normalizing a schema into a "web" with FKs duplicated across tables that may benefit from a relationship may simplify the query you read, it introduces much more serious problems in introducing multiple sources of truth that can become inconsistent.

Take a relationship between Claim, Prescription, and Prescriber.

Option 1: Normalized Prescription has a ClaimId and a PrescriberId. If I am loading a Claim and I want the Prescriber, I go through the Prescription. (Claim.Prescription.Prescriber)

Option 2: Denormalized Prescription has a ClaimId and PrescriberId, and Prescriber has a ClaimId. If I am loading a Claim and I want the Prescriber, I can shortcut to join directly to the Prescriber.

The problem now is that I have two routes between Claim and Prescriber. I can go Claim.Prescriber or Claim.Prescription.Prescriber. What ensures that the Prescriber.ClaimId will always match the Prescription.ClaimId for Prescriptions that join to this Prescriber?

I honestly think that you are going way, way overboard in overthinking the design of this relatively simple system. Database engines are designed to index and resolve data extremely fast and efficiently. It's not something you need to worry about optimizing for unless you're building a system big enough to need to worry about it.

As far as writing queries goes, I would highly recommend learning about navigation properties, and then leverage EF to turn Linq expressions into flattened models suited to what you want to display or work with using Select. EF/Linq is not a C# substitute for writing SQL in that you don't need to manually "Join" entities together. Navigation properties allow EF to work out the joins entirely in the background. Join is provided for those edge cases where there is a relationship between entities that cannot be expressed with a FK. (I.e. OwnerType + OwnerId for anti-pattern tables that can reference 2 or more other tables via a single ID column)