Search code examples
sqlregexgoogle-bigqueryregexp-replace

How to split a string using regex?


I'm trying to split the ad_content with each "_" character, but I don't know why I can't go further than the 9th split word (splits[SAFE_OFFSET(8)] AS objective).

This is the query I'm using:

SELECT
    ad_content,
    splits[SAFE_OFFSET(0)] AS country,
    splits[SAFE_OFFSET(1)] AS product,
    splits[SAFE_OFFSET(2)] AS budget,
    splits[SAFE_OFFSET(3)] AS source,
    splits[SAFE_OFFSET(4)] AS campaign,
    splits[SAFE_OFFSET(5)] AS audience,
    splits[SAFE_OFFSET(6)] AS route_type,
    splits[SAFE_OFFSET(7)] AS business,
    splits[SAFE_OFFSET(8)] AS objective,
    splits[SAFE_OFFSET(9)] AS format,
    splits[SAFE_OFFSET(10)] AS nnn,
    splits[SAFE_OFFSET(11)] AS date,
FROM (
  SELECT
    AD_CONTENT,
    SPLIT(REGEXP_REPLACE(
            AD_CONTENT,
            r'([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_(.+)',
            r'\1|\2|\3|\4|\5|\6|\7|\8|\9|\10|\11|\12'),
          '|') AS splits
  FROM ga_digital_marketing

For example, ad_content = us_latam_perf_facebook_black-friday_bbdd-push_SCL-CCP_domestic_conversion_push_all_20210906

And this is the results using the query from above:

ad_content country product budget source campaign audience route_type business objective format nnn date
us_latam_perf_facebook_black-friday_bbdd-push_SCL-CCP_domestic_conversion_push_all_20210906 us latam perf facebook black-friday bbdd-push SCL-CCP domestic conversion us0 us1 us2

As you can see above, from the format column (splits[SAFE_OFFSET(9)] AS format) are not giving the results correctly.

I believe that the problem is in here: r'\1|\2|\3|\4|\5|\6|\7|\8|\9|\10|\11|\12') because maybe the number 0 of |\10 is not recognizing it as a number but as a string. And that's why I have as a result us0 us1 and us2

Is there a solution for this limitation?

Is there another way to split the ad_content example?


Solution

  • BigQuery's REGEXP_REPLACE supports only \1 to \9 - that is why!

    Is there a solution for this limitation?

    Use below approach instead

    SELECT
        -- ad_content,
        splits[SAFE_OFFSET(0)] AS country,
        splits[SAFE_OFFSET(1)] AS product,
        splits[SAFE_OFFSET(2)] AS budget,
        splits[SAFE_OFFSET(3)] AS source,
        splits[SAFE_OFFSET(4)] AS campaign,
        splits[SAFE_OFFSET(5)] AS audience,
        splits[SAFE_OFFSET(6)] AS route_type,
        splits[SAFE_OFFSET(7)] AS business,
        splits[SAFE_OFFSET(8)] AS objective,
        splits[SAFE_OFFSET(9)] AS format,
        splits[SAFE_OFFSET(10)] AS nnn,
        splits[SAFE_OFFSET(11)] AS date,
    FROM (
      SELECT
        AD_CONTENT,
        SPLIT(AD_CONTENT, '_') AS splits
      FROM ga_digital_marketing
    )    
    

    if applied to sample in your question - output is

    enter image description here