Search code examples
postgresqlsubstring

Extract substrings based on multiple conditions in postgresql


I have following column in a postgres table.

col1                                start    end
p.[A138T;S160G;D221_E222delinsGK]   138      138
p.[A138T;S160G;D221_E222delinsGK]   160      160
p.[A138T;S160G;D221_E222delinsGK]   221      222

I am looking for a way to split the ; separated values in multiple rows. The expected output is:

col1                                start    end
p.A138T                             138      138
p.S160G                             160      160
p.D221_E222delinsGK                 221      222

I am using following query but it does not works for the third row.

select
       case 
            when start = "end" and col1 like 'p.[%' then 'p.'||(regexp_match(col1, '([A-Z]'||start||'[A-Z])'))[1] 
            when start != "end" and col1 like 'p.[%' then 'p.'||(regexp_match(col1, '[A-Z\d+_delins]+'||start||'[A-Z\d+_delins]+'))[1] 
            else col1,
                   
            start,
            end
from table

Any help is highly appreciated!


Solution

  • I prefer substring() to extract strings based on a regex, so I don't have to deal with the array that regexp_match() returns.

    select case 
              when col1 like 'p.%' and start = "end" 
                then 'p.'||substring(col1 from concat('[A-Z]', start, '[A-Z]'))
              when col1 like 'p.%' and start <> "end" 
                then 'p.'||substring(col1 from concat('[A-Z]', start, '_[A-Z]', "end", 'delins[^\]]+'))
              else col1
           end as col1, 
           start,
           "end"
    from the_table;
    

    As far as I can tell, the regex for the second alternative adds the _delins incorrectly to the regex marking the "start" value. And if that is a constant value, it should not be part of a regex "range" because [delins]' would also match the string sindel. Not sure if you intended that.

    Another option would be to simply take everything after the "end" marker and exclude the closing ] by using

    substring(col1 from concat('[A-Z]', start, '_[A-Z]', "end", '[^\]]+'))