Search code examples
vertica

Vertica database added duplicate entry with same primary key


I am running a docker image of Vertica on windows. I have created a table in vertica with this schema (student_id is primary key)

dbadmin@d1f942c8c1e0(*)=> \d testschema.student;
                                        List of Fields by Tables
   Schema   |  Table  |   Column   |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
------------+---------+------------+-------------+------+---------+----------+-------------+-------------
 testschema | student | student_id | int         |    8 |         | t        | t           |
 testschema | student | name       | varchar(20) |   20 |         | f        | f           |
 testschema | student | major      | varchar(20) |   20 |         | f        | f           |
(3 rows)

student_id is a primary key. I am testing loading data from csv file using copy command.

First I used insert - insert into testschema.student values (1,'Jack','Biology');

Then I created a csv file at /home/dbadmin/vertica_test directory -

vi student.csv 
2,Kate,Sociology
3,Claire,English
4,Jack,Biology
5,Mike,Comp. Sci

Then I ran this command

copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;

I tested the result

select * from testschema.student - shows 5 rows
select * from students_rejected; - no rows

Then I creates another csv file with bad data at /home/dbadmin/vertica_test directory

 vi student_bad.csv
bad_data_type_for_student_id,UnaddedStudent, UnaddedSubject
6,Cassey,Physical Education

I added data from bad csv file

copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;

Then I tested the output

select * from testschema.student - shows 6 rows <-- only one row got added. all ok
select * from students_rejected; - shows 1 row  <-- bad row's entry is here. all ok

all looks good

Then I added the bad data again without the rejected data option

copy testschema.students from '/home/dbadmin/vertica_test/student_bad.csv' delimiter ',' ;

But now the entry with student id 6 got added again!!

student_id |  name  |       major
------------+--------+--------------------
          1 | Jack   | Biology
          2 | Kate   | Sociology
          3 | Claire | English
          4 | Jack   | Biology
          5 | Mike   | Comp. Sci
          6 | Cassey | Physical Education <--
          6 | Cassey | Physical Education <--

Shouldn't this have got rejected?


Solution

  • If you created your students with a command of this type:

    DROP TABLE IF EXISTS students;
    CREATE TABLE students (
      student_id int 
    , name       varchar(20)
    , major      varchar(20)
    , CONSTRAINT pk_students PRIMARY KEY(student_id)                                                                                                                                         
    );
    

    that is, without the explicit keyword ENABLED, then the primary key constraint is disabled. That is, you can happily insert duplicates, but will run into an error if you later want to join to the students table via the primary key column.

    With the primary key constraint enabled ...

    [...]
    , CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED                                                                                                                                         
    [...]
    

    I think you get the desired effect.

    The whole scenario:

    DROP TABLE IF EXISTS students;
    CREATE TABLE students (
      student_id int
    , name       varchar(20)
    , major      varchar(20)
    , CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED
    );
    
    INSERT INTO students
              SELECT 1,'Jack'  ,'Biology'
    UNION ALL SELECT 2,'Kate'  ,'Sociology'
    UNION ALL SELECT 3,'Claire','English'
    UNION ALL SELECT 4,'Jack'  ,'Biology'
    UNION ALL SELECT 5,'Mike'  ,'Comp. Sci'
    UNION ALL SELECT 6,'Cassey','Physical Education'
    ;
    -- out  OUTPUT 
    -- out --------
    -- out       6
    
    COMMIT;
    
    COPY students FROM STDIN DELIMITER ','
    REJECTED DATA AS TABLE students_rejected;
    6,Cassey,Physical Education
    \.
    -- out vsql:/home/gessnerm/._vfv.sql:4: ERROR 6745:  
    -- out  Duplicate key values: 'student_id=6' 
    -- out  -- violates constraint 'dbadmin.students.pk_students'
    SELECT * FROM students;
    -- out  student_id |  name  |       major        
    -- out ------------+--------+--------------------
    -- out           1 | Jack   | Biology
    -- out           2 | Kate   | Sociology
    -- out           3 | Claire | English
    -- out           4 | Jack   | Biology
    -- out           5 | Mike   | Comp. Sci
    -- out           6 | Cassey | Physical Education
    SELECT * FROM students_rejected;
    -- out  node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason 
    -- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
    -- out (0 rows)
    

    And the only reliable check seems to be the ANALYZE_CONSTRAINTS() call ...

    ALTER TABLE students ALTER CONSTRAINT pk_students DISABLED;
    -- out Time: First fetch (0 rows): 7.618 ms. All rows formatted: 7.632 ms
    COPY students FROM STDIN DELIMITER ','
    REJECTED DATA AS TABLE students_rejected;
    6,Cassey,Physical Education
    \.
    -- out Time: First fetch (0 rows): 31.790 ms. All rows formatted: 31.791 ms
    
    SELECT * FROM students;
    -- out  student_id |  name  |       major        
    -- out ------------+--------+--------------------
    -- out           1 | Jack   | Biology
    -- out           2 | Kate   | Sociology
    -- out           3 | Claire | English
    -- out           4 | Jack   | Biology
    -- out           5 | Mike   | Comp. Sci
    -- out           6 | Cassey | Physical Education
    -- out           6 | Cassey | Physical Education
    SELECT * FROM students_rejected;
    -- out  node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason 
    -- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
    -- out (0 rows)
    SELECT ANALYZE_CONSTRAINTS('students');
    -- out  Schema Name | Table Name | Column Names | Constraint Name | Constraint Type | Column Values 
    -- out -------------+------------+--------------+-----------------+-----------------+---------------
    -- out  dbadmin     | students   | student_id   | pk_students     | PRIMARY         | ('6')
    -- out (1 row)