Search code examples
sqlpostgresqlsubstring

How to withdraw the substring that follow a pattern in PostgreSQL


I need to find a pattern which will give me a substring which follows logic to substring from the right side, after the third _.

What I want to achieve is to retrieve everything up until 3rd _ .

I tried this SQL, but it did not work

select 
    SUBSTRING('22D_XYZ_xy_100_xyz_123', 1, position('_' in REVERSE('22D_XYZ_xy_100_xyz_123')) - 1) as result;

It returns the 22D whereas I expect 22D_XYZ_xy


Solution

  • The phrase, "substring which follows logic to substring from the right side, after the third _," is difficult to understand. I infer that the intended behavior is to extract the subtring that includes all text before the third _ from the right and not all text before the third _ from the left. When searching from the left, there is also ambiguity as to expected behavior if there are less than three _ characters: should the entire string be returned or should the result be NULL? Unfortunately, the example input, 22D_XYZ_xy_100_xyz_123, and expected output, 22D_XYZ_xy, are consistent with all of these cases. The following query demonstrates all three interpretations:

    SELECT
      string,
      regexp_substr(string, '^.*(?=(_[^_]*){3}$)') as from_right,
      regexp_substr(string, '^([^_]*_){0,2}[^_]*') as from_left,
      regexp_substr(string, '^([^_]*_){2}[^_]*(?=_)') as from_left_requires_3
    FROM (
      VALUES ('22D_XYZ_xy_100_xyz_123'),
             ('1_2_3_4_5_6_7'),
             ('1_2_3_4_5_6'),
             ('1_2_3_4_5_'),
             ('_1_2_3_4_5'),
             ('1_2_3_4_5'),
             ('1_2_3_4'),
             ('1_2_3'),
             ('1_2_'),
             ('1_2'),
             ('1'),
             ('____'),
             ('___'),
             ('__'),
             ('_'),
             (''),
             (NULL)) t(string);
    

    Executing the query gives the following:

    string from_right from_left from_left_requires_3
    '22D_XYZ_xy_100_xyz_123' '22D_XYZ_xy' '22D_XYZ_xy' '22D_XYZ_xy'
    '1_2_3_4_5_6_7' '1_2_3_4' '1_2_3' '1_2_3'
    '1_2_3_4_5_6' '1_2_3' '1_2_3' '1_2_3'
    '1_2_3_4_5_' '1_2_3' '1_2_3' '1_2_3'
    '_1_2_3_4_5' '_1_2' '_1_2' '_1_2'
    '1_2_3_4_5' '1_2' '1_2_3' '1_2_3'
    '1_2_3_4' '1' '1_2_3' '1_2_3'
    '1_2_3' NULL '1_2_3' NULL
    '1_2_' NULL '1_2_' NULL
    '1_2' NULL '1_2' NULL
    '1' NULL '1' NULL
    '____' '_' '__' '__'
    '___' '' '__' '__'
    '__' NULL '__' NULL
    '_' NULL '_' NULL
    '' NULL '' NULL
    NULL NULL NULL NULL

    To search from the right, a positive lookahead zero-length assertion, (?=(_[^_]*){3}$), is used to check that the unmatched portion of the string begins with _ and includes exactly three _ characters. If the assertion is false, then NULL is returned.

    To find the left substring with no more than two _ characters, the regular expression matches up to two substings of zero or more non-_ characters followed by a single _ along with any non-_ characters prior to the next _ or the end of the string.

    Finding the left substring preceding a required third _ is similar to finding the left substring with no more than two _ characters, except that the occurrence specifier for the group is {2} instead of {0,2} and a positive lookahead zero-lengh assertion, (?=_), ensures that _ is the first unmatched character. As with the search from the right, if the assertion is false, then NULL is returned.