Search code examples
sqldatabasedata-modeling

Database Design: Best approach for handling hierarchical data insertion in SQL database


I have a SQL database with hierarchical tables, including Campaign, AdGroup, and AdGroupAd, where each table references the previous one through foreign key relationships. Inserting data into these tables becomes complex as I need to ensure the correct order of insertion to maintain data integrity. I'm considering two approaches and seeking advice on the best approach:

  • Insert records in the correct order: This involves manually inserting data into each table in the hierarchical order, ensuring that the required relationships are maintained. For example, I would first insert a campaign, then an ad group referencing the campaign, and finally an ad referencing the ad group.

  • Use an entry table with a trigger: This approach involves creating an additional table called LoadingZone, acting as an entry point for the data. A trigger on the LoadingZone table would handle the insertion process by automatically inserting records into the respective tables in the correct order, based on the data provided.

I have provided small SQL examples for both scenarios above. I would appreciate insights into the pros and cons of each approach and guidance on the best approach considering factors such as data integrity, performance, and maintenance. Which approach is more suitable for handling hierarchical data insertion in a SQL database?

Note that the examples are simplified versions of the real data model.

Database:
CREATE TABLE Campaign (
  id INT PRIMARY KEY,
  campaign_name VARCHAR(50)
);

CREATE TABLE AdGroup (
  id INT PRIMARY KEY,
  ad_group_name VARCHAR(50),
  campaign_id INT,
  FOREIGN KEY (campaign_id) REFERENCES Campaign(id)
);

CREATE TABLE AdGroupAd (
  id INT PRIMARY KEY,
  ad_name VARCHAR(50),
  ad_group_id INT,
  FOREIGN KEY (ad_group_id) REFERENCES AdGroup(id)
);

Scenario 1: Insert records in the correct order:
-- Insert a campaign
INSERT INTO Campaign (id, campaign_name) VALUES (1, 'International Stuff');

-- Insert an ad group for the campaign
INSERT INTO AdGroup (id, ad_group_name, campaign_id) VALUES (1, 'Regional 1', 1);

-- Insert an ad for the ad group
INSERT INTO AdGroupAd (id, ad_name, ad_group_id) VALUES (1, 'Blue', 1);
Scenario 2: Use an entry table with a trigger
-- Create the loading table
CREATE TABLE LoadingZone (
  id INT PRIMARY KEY,
  campaign_name VARCHAR(50),
  ad_group_name VARCHAR(50),
  ad_name VARCHAR(50)
);

-- Create the trigger function
CREATE OR REPLACE FUNCTION insert_into_tables() RETURNS TRIGGER AS $$
BEGIN
  -- Insert into Campaign table
  INSERT INTO Campaign (id, campaign_name)
  SELECT NEW.id, NEW.campaign_name
  ON CONFLICT DO NOTHING;

  -- Insert into AdGroup table
  INSERT INTO AdGroup (id, ad_group_name, campaign_id)
  SELECT NEW.id, NEW.ad_group_name, c.id
  FROM Campaign c
  WHERE c.campaign_name = NEW.campaign_name
  ON CONFLICT DO NOTHING;

  -- Insert into AdGroupAd table
  INSERT INTO AdGroupAd (id, ad_name, ad_group_id)
  SELECT NEW.id, NEW.ad_name, ag.id
  FROM AdGroup ag
  WHERE ag.ad_group_name = NEW.ad_group_name
  ON CONFLICT DO NOTHING;

  RETURN NULL;
END;
$$ LANGUAGE plpgsql;

-- Create the trigger
CREATE TRIGGER insert_trigger
AFTER INSERT ON LoadingZone
FOR EACH ROW
EXECUTE FUNCTION insert_into_tables();
-- Insert data into the LoadingZone table
INSERT INTO LoadingZone (id, campaign_name, ad_group_name, ad_name)
VALUES (1, 'International Stuff', 'Regional 1', 'Blue');

These are my thoughts in favor of the trigger:

  1. Self-documentation: The trigger-based approach self-documents the correct insertion order, making it easier to understand the relationships between the tables.
  2. Fail-safe: If an insertion fails due to an error thrown in one of the child tables, the database should roll back the entire insert. Otherwise, this scenario could be more difficult to recover from.
  3. Stronger data integrity: The trigger enforces the correct order of insertion, reducing the risk of data inconsistencies and ensuring stronger data integrity.

Thanks for any guidance on this topic.


Solution

  • Triggers are useful, but they are also invisible, complicated, and can get in the way of doing things not envisioned by the trigger author. Adding a table just for the trigger adds even more complexity.

    Instead, rely on constraints, such as foreign key and not null, to ensure referential integrity; make it impossible to insert invalid rows.

    • Your foreign keys will prevent referencing an item which does not exist. This will catch if you try to insert in the wrong order.

    • Declare all the foreign keys not null. This prevents inserting a row with no parent.

    • Use auto-increment for primary keys to ensure you always get a unique id. This also makes it obvious you're inserting in the wrong order; you need to insert in the right order to get the foreign keys. It also avoids race conditions with other connections inserting at the same time.

    • Use last_insert_id (MySQL-specific) to get the ID of the previous auto-incremented insert.

    • Use transactions to rollback all the inserts if any fail.

    Other notes on your schema.

    • Use bigints for primary keys. The ID is incremented every time an insert is attempted. 2 billion attempted inserts comes up fast.

    • Don't arbitrarily limit the size of text fields. varchar is VARiable character size; you're not saving any space and just limiting your users.

    CREATE TABLE Campaign (
      id bigint PRIMARY KEY AUTO_INCREMENT,
      campaign_name VARCHAR(255) not null
    );
    
    CREATE TABLE AdGroup (
      id bigint PRIMARY KEY AUTO_INCREMENT,
      ad_group_name VARCHAR(255) not null,
      campaign_id bigint not null,
      FOREIGN KEY (campaign_id) REFERENCES Campaign(id)
    );
    
    CREATE TABLE AdGroupAd (
      id bigint PRIMARY KEY AUTO_INCREMENT,
      ad_name VARCHAR(255) not null,
      ad_group_id bigint not null,
      FOREIGN KEY (ad_group_id) REFERENCES AdGroup(id)
    );
    
    begin;
    
    INSERT INTO Campaign (campaign_name) VALUES ('International Stuff');
    
    -- Use `select last_insert_id()` to get the ID of the Campaign.
    INSERT INTO AdGroup (ad_group_name, campaign_id) VALUES ('Regional 1', (select last_insert_id()));
    
    -- Use `select last_insert_id()` to get the ID of the AdGroupAd.
    INSERT INTO AdGroupAd (ad_name, ad_group_id) VALUES ('Blue', (select last_insert_id()));
    
    commit;
    
    BEGIN;
    
    INSERT INTO Campaign (campaign_name) VALUES ('Buy our stuff');
    
    INSERT INTO AdGroup (ad_group_name, campaign_id) VALUES ('Stuff sellers', (select last_insert_id()));
    
    -- Bad insert, this will raise an error.
    INSERT INTO AdGroupAd (ad_name, ad_group_id) VALUES ('Stuff', 1234567);
    
    -- Roll back all the inserts.
    ROLLBACK;
    

    Demonstration.

    Note: MySQL's transaction handling is inconsistent. "InnoDB sometimes rolls back only the statement that failed, other times it rolls back the entire transaction." 🤦 That's why the explicit rollback is necessary. Other databases will rollback the whole transaction on error automatically. MySQL is quite non-standard and will teach you bad habits, I advocate you switch to PostgreSQL, especially while learning.

    You can write a small program to do the inserts, or as SalmanA suggested, a stored procedure, but it shouldn't be necessary.