I am running a docker image of Vertica
on windows
. I have created a table in vertica
with this schema (student_id
is primary key
)
dbadmin@d1f942c8c1e0(*)=> \d testschema.student;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
------------+---------+------------+-------------+------+---------+----------+-------------+-------------
testschema | student | student_id | int | 8 | | t | t |
testschema | student | name | varchar(20) | 20 | | f | f |
testschema | student | major | varchar(20) | 20 | | f | f |
(3 rows)
student_id
is a primary key. I am testing loading data from csv
file using copy
command.
First I used insert
- insert into testschema.student values (1,'Jack','Biology');
Then I created a csv
file at /home/dbadmin/vertica_test
directory -
vi student.csv
2,Kate,Sociology
3,Claire,English
4,Jack,Biology
5,Mike,Comp. Sci
Then I ran this command
copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;
I tested the result
select * from testschema.student - shows 5 rows
select * from students_rejected; - no rows
Then I creates another csv
file with bad data at /home/dbadmin/vertica_test directory
vi student_bad.csv
bad_data_type_for_student_id,UnaddedStudent, UnaddedSubject
6,Cassey,Physical Education
I added data from bad csv
file
copy testschema.students from '/home/dbadmin/vertica_test/student.csv' delimiter ',' rejected data as table students_rejected;
Then I tested the output
select * from testschema.student - shows 6 rows <-- only one row got added. all ok
select * from students_rejected; - shows 1 row <-- bad row's entry is here. all ok
all looks good
Then I added the bad data again without the rejected data option
copy testschema.students from '/home/dbadmin/vertica_test/student_bad.csv' delimiter ',' ;
But now the entry with student id 6 got added again!!
student_id | name | major
------------+--------+--------------------
1 | Jack | Biology
2 | Kate | Sociology
3 | Claire | English
4 | Jack | Biology
5 | Mike | Comp. Sci
6 | Cassey | Physical Education <--
6 | Cassey | Physical Education <--
Shouldn't this have got rejected?
If you created your students
with a command of this type:
DROP TABLE IF EXISTS students;
CREATE TABLE students (
student_id int
, name varchar(20)
, major varchar(20)
, CONSTRAINT pk_students PRIMARY KEY(student_id)
);
that is, without the explicit keyword ENABLED
, then the primary key constraint is disabled. That is, you can happily insert duplicates, but will run into an error if you later want to join to the students
table via the primary key column.
With the primary key constraint enabled ...
[...]
, CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED
[...]
I think you get the desired effect.
The whole scenario:
DROP TABLE IF EXISTS students;
CREATE TABLE students (
student_id int
, name varchar(20)
, major varchar(20)
, CONSTRAINT pk_students PRIMARY KEY(student_id) ENABLED
);
INSERT INTO students
SELECT 1,'Jack' ,'Biology'
UNION ALL SELECT 2,'Kate' ,'Sociology'
UNION ALL SELECT 3,'Claire','English'
UNION ALL SELECT 4,'Jack' ,'Biology'
UNION ALL SELECT 5,'Mike' ,'Comp. Sci'
UNION ALL SELECT 6,'Cassey','Physical Education'
;
-- out OUTPUT
-- out --------
-- out 6
COMMIT;
COPY students FROM STDIN DELIMITER ','
REJECTED DATA AS TABLE students_rejected;
6,Cassey,Physical Education
\.
-- out vsql:/home/gessnerm/._vfv.sql:4: ERROR 6745:
-- out Duplicate key values: 'student_id=6'
-- out -- violates constraint 'dbadmin.students.pk_students'
SELECT * FROM students;
-- out student_id | name | major
-- out ------------+--------+--------------------
-- out 1 | Jack | Biology
-- out 2 | Kate | Sociology
-- out 3 | Claire | English
-- out 4 | Jack | Biology
-- out 5 | Mike | Comp. Sci
-- out 6 | Cassey | Physical Education
SELECT * FROM students_rejected;
-- out node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason
-- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
-- out (0 rows)
And the only reliable check seems to be the ANALYZE_CONSTRAINTS()
call ...
ALTER TABLE students ALTER CONSTRAINT pk_students DISABLED;
-- out Time: First fetch (0 rows): 7.618 ms. All rows formatted: 7.632 ms
COPY students FROM STDIN DELIMITER ','
REJECTED DATA AS TABLE students_rejected;
6,Cassey,Physical Education
\.
-- out Time: First fetch (0 rows): 31.790 ms. All rows formatted: 31.791 ms
SELECT * FROM students;
-- out student_id | name | major
-- out ------------+--------+--------------------
-- out 1 | Jack | Biology
-- out 2 | Kate | Sociology
-- out 3 | Claire | English
-- out 4 | Jack | Biology
-- out 5 | Mike | Comp. Sci
-- out 6 | Cassey | Physical Education
-- out 6 | Cassey | Physical Education
SELECT * FROM students_rejected;
-- out node_name | file_name | session_id | transaction_id | statement_id | batch_number | row_number | rejected_data | rejected_data_orig_length | rejected_reason
-- out -----------+-----------+------------+----------------+--------------+--------------+------------+---------------+---------------------------+-----------------
-- out (0 rows)
SELECT ANALYZE_CONSTRAINTS('students');
-- out Schema Name | Table Name | Column Names | Constraint Name | Constraint Type | Column Values
-- out -------------+------------+--------------+-----------------+-----------------+---------------
-- out dbadmin | students | student_id | pk_students | PRIMARY | ('6')
-- out (1 row)