Search code examples
sqlpostgresqlpostgresql-9.3

Transferring data between tables violation of primary key constraint


I am working on a database project with postgres. I have a big table that contains data imported from a csv file that I need to transfer to other smaller tables that represent the database I designed.

The big table with imported data is called data_minerva, the table I want to transfer part of the data to is called related_articles. Here is part of the ddl code:

CREATE SEQUENCE article_id_seq; 
CREATE TABLE article (
    article_id integer UNIQUE NOT NULL DEFAULT nextval('article_id_seq'),
    title varchar,
    body varchar,
    publish_time timestamp,
    creation_time timestamp,
    id integer,
    PRIMARY KEY (article_id),
    FOREIGN KEY (id) REFERENCES team (id)
);
ALTER SEQUENCE article_id_seq OWNED BY article.article_id;

CREATE TABLE related_articles (
    article_id1 integer NOT NULL,
    article_id2 integer NOT NULL,
    kind varchar,
    PRIMARY KEY (article_id1, article_id2, kind),
    FOREIGN KEY (article_id1) REFERENCES article (article_id),
    FOREIGN KEY (article_id2) REFERENCES article (article_id)
);

As you can see in the above code snippet an article is defined by it's ID. The data_minerva table does not contain an ID column. Now when I want to transfer data from data_minerva to related_articles I come into trouble that there are duplicates in the data_minerva table and they violate the primary key constrain of the table related_articles. However I tried creating a rule to ignore these duplicates but with no succes. I think I need to do something more with the SELECT DISTINCT but I can't to figure it out. The query I use to transfer data:

CREATE RULE "ignore" AS ON INSERT TO related_articles
    WHERE EXISTS (SELECT 1 FROM related_articles WHERE article_id1=NEW.article_id1 AND article_id2=NEW.article_id2 AND kind=NEW.kind)
DO INSTEAD NOTHING;


INSERT INTO related_articles (article_id1, article_id2, kind)
SELECT DISTINCT ON (data_minerva.articletitle, data_minerva.articlestarttime, data_minerva.writeremail,article.id, article.id, data_minerva.linkedarticletitle, data_minerva.linkedarticlestarttime)
(SELECT article_id FROM article WHERE data_minerva.linkedarticletitle IS NOT NULL AND article.title=data_minerva.articletitle AND article.creation_time=data_minerva.articlestarttime::timestamp),
(SELECT article_id FROM article WHERE article.title=data_minerva.linkedarticletitle AND article.creation_time=data_minerva.linkedarticlestarttime::timestamp),
linkedtype FROM data_minerva, article WHERE data_minerva.linkedarticletitle IS NOT NULL;

Solution

  • You're probably better served by thinking along these lines.

    select a1.article_id, a2.article_id, d.linkedtype
    from article a1
    inner join data_minerva d on a1.title = d.articletitle and a1.creation_time = d.articlestarttime
    inner join article a2 on a2.title = d.linkedarticletitle and a2.creation_time = d.linkedarticlestarttime
    

    It's possible this needs to be select distinct instead of just select. You might also need to check for null article titles, or you might need to handle null article titles separately.

    Once you're satisfied that this query returns the right rows, just put insert into related_articles on top, and run it.