Search code examples
sqlregexamazon-athenaprestotext-extraction

Extract Strings Using AWS Athena or PrestoDB Regex Function


I have a table named

logs

and column named

url

whose value is like this

https://api.abc.com:443/xyz/live/dashboard_critical/v1?cust_id=-1111%7C-1111%7C-1111%7C-1111%7C-1111%7C-1111%7C-1111%7C-1111%7C-1111%7C-1111%7C-1111&model_id=&startdate=2021-06-29&enddate=2021-07-28&wcombo=&site_id=&instance=&unit=&combo=&source=&priority=&verification=&critical=Yes%7Ctrue&percentile=95&startevent=receiveddatetime&endevent=uploadtohostdatetime

I trying to extract all strings after '/' OR '?' OR '%7C' OR '&'

I am able to extract it individually but not all 4 together

SELECT regexp_extract_all(url, '\d+[/]*')
FROM logs.url
WHERE request_verb='GET'
        AND REGEXP_LIKE(url, 'https://api.abc.com:443/xyz/live/');

I am looking for output like below

[https:,api.abc.com:443,xyz,live,dashboard_critical,v1,cust_id=-1111,-1111,-1111,-1111,-1111,-1111,-1111,-1111,-1111,-1111,-1111,model_id=,startdate=2021-06-29,enddate=2021-07-28,wcombo=,site_id=,instance=,unit=,combo=,source=,priority=,verification=,critical=Yes,true,percentile=95,startevent=receiveddatetime,endevent=uploadtohostdatetime]

Solution

  • You can use regexp_split(str, regexp) function, as a regexp pattern concatenate all values by wich string should be splitted using | (OR in regexp), it will produce array required. Note: some characters have special meaning in Presto CLI or regexp and need shielding.

    select regexp_split(url,'/+|\?|%%7C|&')
    

    Regexp meaning:

    /+ - one or more slash

    | - OR

    \? - literally ?, needs shielding by backslash because ? has special meaning in regexp

    | - OR

    %%7C - literally %7C. % has special meaning in Presto CLI and presto is using double-character notation for shielding.

    | - OR

    & - literally &

    After you got an ARRAY, you can concatenate it or explode, etc