Search code examples
postgresqlforeign-keysreferential-integritydata-ingestion

How can a relational database with foreign key constraints ingest data that may be in the wrong order?


The database is ingesting data from a stream, and all the rows needed to satisfy a foreign key constraint may be late or never arrive.

This can likely be accomplished by using another datastore, one without foreign key constraints, and then when all the needed data is available, read into the database which has fk constraints. However, this adds complexity and I'd like to avoid it.

We're working on a solution that creates "placeholder" rows to point the foreign key to. When the real data comes in, the placeholder is replaced with real values. Again, this adds complexity, but it's the best solution we've found so far.

How do people typically solve this problem?

Edit: Some sample data which might help explain the problem:

Let's say we have these tables:

CREATE TABLE order (
  id INTEGER NOT NULL,
  order_number,
  PRIMARY KEY (id),
  UNIQUE (order_number)
);

CREATE TABLE line_item (
  id INTEGER NOT NULL,
  order_number INTEGER REFERENCES order(order_number),
  PRIMARY KEY (id)
);

If I insert an order first, not a problem! But let's say I try:

INSERT INTO line_item (order_number) values (123) before order 123 was inserted. This will fail the fk constraint of course. But this might be the order I get the data, since it's reading from a stream that is collecting this data from multiple sources.

Also, to address @philpxy's question, I didn't really find much on this. One thing that was mentioned was deferred constraints. This is a mechanism that waits to do the fk constraints at the end of a transaction. I don't think it's possible to do that in my case however, since these insert statements will be run at random times whenever the data is received.


Solution

  • You have a business workflow problem, because line items of individual orders are coming in before the orders themselves have come in. One workaround, perhaps not ideal, would be to create a before insert trigger which checks, for every incoming insert to the line_item table, whether that order already exists in the order table. If not, then it will first insert the order record before trying the insert on line_item.

    CREATE OR REPLACE FUNCTION "public"."fn_insert_order" () RETURNS trigger AS $$
    BEGIN
        INSERT INTO "order" (order_number)
        SELECT NEW.order_number
        WHERE NOT EXISTS (SELECT 1 FROM "order" WHERE order_number = NEW.order_number);
        RETURN NEW;
    END
    $$
    LANGUAGE 'plpgsql'
    
    # trigger 
    CREATE TRIGGER "trigger_insert_order"
    BEFORE INSERT ON line_item FOR EACH ROW
    EXECUTE PROCEDURE fn_insert_order()
    

    Note: I am assuming that the id column of the order table in fact is auto increment, in which case Postgres would automatically assign a value to it when inserting as above. Most likely, this is what you want, as having two id columns which both need to be manually assigned does not make much sense.