Search code examples
regexgoogle-bigqueryregexp-replace

BigQuery REGEX get text between 2 characters


From this string :

gs://analytics_models_from_g_rdd_ga/9g653798d-7e13-4238-55443-6db72ace9911$$$$events_today.csv

I'm trying to extract this text :

9g653798d-7e13-4238-55443-6db72ace9911

and in separate REGEX function events_today

The rolls for the first string:

  1. The string will always end with $$$$.
  2. The string will always start with ga/.
  3. The extracted text length can be changed.

I sew several tutorials and red regex references but still not succeed I thought the solution will be /\\s*(.*?)\\s*$$$$ but is not.

In general, to extract a string between 2 strings


Solution

  • In general, to match any string between two specific chars, you can use

    REGEXP_EXTRACT(col, r"<CHAR>(.*?)<CHAR>")
    

    To match between / and $, you can use

    REGEXP_EXTRACT(col, r"/(.*?)\$")
    

    Also, see the list of chars that must be escaped in a regex at What special characters must be escaped in regular expressions?

    Here, you can use

    REGEXP_REPLACE(col, r".*ga/([^/$]*)\${4}.*", r"\1")
    

    See the regex demo.

    The identical, but a bit simplified since .* will become redundant, pattern can be used in REGEGEXP_EXTRACT:

    REGEXP_EXTRACT(col, r"ga/([^/$]*)\${4}")
    

    Pattern details:

    • .* - any zero or more chars other than line break chars, as many as possible
    • ga/ - ga and a slash
    • ([^/$]*) - Group 1: any zero or more chars other than / and $
    • \${4} - four $ symbols
    • .* - any zero or more chars other than line break chars, as many as possible.

    The replacement is the Group 1 backreference, \1, that replaces the whole match with the contents of the group.