Search code examples
sqlsql-insertpostgresql-9.3

How to find out bad data causing this insert to fail


I have a database (Postgres 9.3.5) of 80 millions records, the insert query below fails with:

ERROR:  invalid input syntax for integer: ""

INSERT INTO DISCOGS.TRACK_DURATION
     SELECT
        track_id,
        duration,
        hours_as_seconds + minutes_as_seconds + seconds as total_seconds
    FROM (
            select
            track_id,
            duration,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 1))::bigint * 60 * 60
                ELSE 0
            END  as hours_as_seconds,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 2))::bigint * 60
                WHEN duration like '%:%'  THEN  (split_part(duration, ':', 1))::bigint * 60
                ELSE 0
            END as minutes_as_seconds,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 3))::bigint
                WHEN duration like '%:%'   THEN (split_part(duration, ':', 2))::bigint
                ELSE 0
            END as seconds
            from discogs.track t1
            where release_id < 10000000
            and t1.duration!='' and t1.duration is not null
            and t1.position!=''
    ) as s1

I can use the where release_id to limit the number of records checked and with lower values its fine, so its bad data , but with so many records how do I find the problem data. Note I'm already filtering out values where duration is empty string and I also found a few records with bad data (such as %%%%) that I have changed but it is still failing.


Solution

  • I would search for malformed durations using a regular expression, as in:

    create table duration (
      d varchar(20)
    );
    
    insert into duration (d) values ('12:34:56');
    insert into duration (d) values ('34:56');
    insert into duration (d) values ('15::'); -- bad one
    insert into duration (d) values (':34:56'); -- bad one
    insert into duration (d) values (':34:'); -- bad one
    insert into duration (d) values ('12:34:'); -- bad one
    insert into duration (d) values ('34:'); -- bad one
    insert into duration (d) values (':56'); -- bad one
    
    select *
      from duration 
      where d not similar to '([0-9]+:)?[0-9]+:[0-9]+'
    

    Result:

    d                     
    ------
    15::                  
    :34:56                
    :34:                  
    12:34:                
    34:                   
    :56 
    

    In your case the query should look like:

    select track_id, duration 
      from discogs.track
      where duration not similar to '([0-9]+:)?[0-9]+:[0-9]+';