Search code examples
sqlregexpostgresqlparsingstring-parsing

How to parse names and percents


Import table

create table import 
(
    articleNumber c(20) primary key, 
    material char(70) 
);

insert into import 
values ('TEST', 'Main fabric: 50% Organic Cotton, 49% Cotton, 1% Elastane' )

The material column contains percents and names, in this format:

<percent>%<space><name>,

The Product table contains those values in columns:

create table toode 
(
    originart char(20) primary key,
    mprots1 Numeric(9,3),
    mprots2 Numeric(9,3),
    mprots3 Numeric(9,3),
    materjal1 char(6),
    materjal2 char(6),
    materjal3 char(6),
    materjal4 char(6),
    materjal5 char(6) 
);

insert into toode (originart)  
values ('TEST')

How to update product table row with values from material string?

Updated row should contain

mprots1 = 50
mprots2 = 49
mprots3 = 1
materjal1 = Organi
materjal2 = Cotton
materria3 = Elasta

I'm using Postgres 16


Solution

  • Combining the other two threads, regexp_substr() can handle that:
    demo at db<>fiddle

    insert into toode
    select articleNumber
          ,regexp_substr(material,'(\d+)%',1,1,'',1)::numeric(9,3)
          ,regexp_substr(material,'(\d+)%',1,2,'',1)::numeric(9,3)
          ,regexp_substr(material,'(\d+)%',1,3,'',1)::numeric(9,3)
          ,regexp_substr(material,'% ([^,]+),?',1,1,'',1)
          ,regexp_substr(material,'% ([^,]+),?',1,2,'',1)
          ,regexp_substr(material,'% ([^,]+),?',1,3,'',1)
    from import_
    returning *;
    
    originart mprots1 mprots2 mprots3 materjal1 materjal2 materjal3 materjal4 materjal5
    TEST 50.000 49.000 1.000 Organic Cotton Cotton Elastane null null

    For percentages, that's at least one digit, followed by a %.

    For material names:

    • You're looking for % followed by the material name
    • The material name is expected to be anything after that, up until the end or nearest comma. The [^,]+ means one or more character other than the comma, so it tolerates spaces, hyphenated names etc.
    • ,? mean at most one comma, so that you don't need a trailing comma at the end to get a match

    The parameters after the pattern are the same as in the post in the other thread.

    • 1 tells it to start at the beginning
    • 1, 2 or 3 means you want the 1st, 2nd or 3rd match
    • '' is where you'd add flags, like 'i' for case insensitive matching
    • 1 means the first parenthesised sub-expression within the pattern, so it'll only catch the material name, without the leading % and the optional trailing comma.