Search code examples
sqlamazon-athenaprestoregexp-replacetrino

AWS athena regex replace except first occurrence


I have some string values in column like below

hel_some_data
h_some_data_more_data
hello_some_more_data_data

I need output like below

    hel_some data
    h_some data more data
    hello_some more data data

Basically I want to replace '_' with ' ' except first occurrence.

I am trying to use regex replace but unable to find how to give the position, I have found one syntax on documentation where it looks like providing the position by code from documentation itself not working, I am trying below example for documentation.

SELECT REGEXP_REPLACE('the fox', 'FOX', 'quick brown fox', 1, 'i');

But it says function not found.

Any help, highly appreciated, Thanks,


Solution

  • It seems that you are looking at docs for Redshift, which does have such REGEXP_REPLACE function allowing to specify position (which does not mean what you think it does, if I understood correctly your attempt).

    Athena is not Redshift, it is based on Presto/Trino and does not support such version of REGEXP_REPLACE.

    One option is to use version which splits the string into array of strings based on regex and applies join function (docs):

    -- sample data
    WITH dataset(str) AS (
       values ('hel_some_data'),
            ('h_some_data_more_data'),
            ('hello_some_more_data_data'),
            ('hel_somedata')
    )
    
    -- query
    select regexp_replace(str, '([^_]*_[^_]*)(.*)', x -> x[1] || replace(x[2], '_', ' '))
    from dataset;
    

    Output:

               _col0
    ---------------------------
     hel_some data
     h_some data more data
     hello_some more data data
     hel_some
    

    ([^_]*_[^_]*)(.*) explanation @regex101. Basically the first group matches two strings without _ separated by _ and the second one matches what is left.