How to retrieve a fixed length string from json array

I have a json array that I need to pull a string from. We're storing this data in Bigquery and for the most part it's been pretty easy, except for extracting the id from the user_id field from the array. There are seven possible strings on how the user_id could appear. For instance, in the json array, user_id could be something like this: "user_id": "client-id-12345678-abcd-0926-zyxw-de07e3f9ce85" or it could be "user_id": "operator-id-12345678-abcd-0926-zyxw-de07e3f9ce85" or "user_id": "auto_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"

Here is what the sample json looks like:

{
    "members": [
        {
            "channel_count": 0,
            "profile_url": "something.png",
            "push_enabled": true,
            "push_trigger_option": "default",
            "state": "joined",
            "user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
        }
    ]
}

I thought the best way to handle this originally was to run a sql replace like so: replace(replace('client-us-operator-us-iddddddd','client-us-', '' ), 'operator-us-', '') But because there are a couple different user_id prefixes it's making it difficult. I am not sure if those are all the possible edge cases. If it's just client and operator it's 12345678-abcd-0926-zyxw-de07e3f9ce85 and that works as expected. When it's "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85" it pulls in the whole string, and sometime the user_id appears as "agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"

I just need everything to the right of the prefix. So in this case it would be 12345678-abcd-0926-zyxw-de07e3f9ce85. However depending on if it's an automated message, or a user responding the prefix length changes, as well as the hyphen or underscore.

I'm not the strongest when using regex, so I'm on here to see if there a better way of handling the extraction of this id from the whole user_id string.

Solution

REGEXP is what you will want to use since it allows for a defined set of pattern matching both for including and excluding. Answer at bottom.

Breakdown of what the functions are doing:

JSON_EXTRACT_SCALAR(json_string, json_path)

json_string: the json string where you are trying to extract a value.
json_path: path to the value within the json, using $.fieldname to point to the root of the JSON object (the $), and the specific field to locate (.fieldname).

JSON_EXTRACT_SCALAR(your_json_object, '$.user_id') -From my json, find the root key value pair for user_id.

This also applies to the UNNEST of all the JSONs for members.

REGEXP_EXTRACT(string, regex_pattern): for a string, match a pattern.

string: the extracted portion of user_id from JSON_EXTRACT_SCALAR()
regex_pattern: r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$'
- ^ points to start of string
- (?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_) non capturing group. (?...) syntax groups prefixes without capturing them in the output. This is where you would put your different prefix possibilities.
- ([a-zA-Z0-9-]+) This is the capturing group that you want in the output. from left to right it says: capture anything that is from a-z (lowercase) A-Z (uppercase), 0-9 (digits), - (hyphens). If you expect ids to have other captured formating, you would include it here. Such as a period or slash.
- $ says the end of the string.

WITH sample_data AS (
  SELECT '''
  {
    "members": [
        {
            "channel_count": 0,
            "profile_url": "something.png",
            "push_enabled": true,
            "push_trigger_option": "default",
            "state": "joined",
            "user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
        },
        {
            "channel_count": 1,
            "profile_url": "another.png",
            "push_enabled": false,
            "push_trigger_option": "custom",
            "state": "joined",
            "user_id": "operator-id-87654321-zyxw-0926-abcd-ce85de07e3f9"
        }
    ]
  }
  ''' AS json_data
)

SELECT 
  REGEXP_EXTRACT(JSON_EXTRACT_SCALAR(m, '$.user_id'), r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$') AS extracted_user_id
FROM 
  sample_data,
  UNNEST(JSON_EXTRACT_ARRAY(json_data, "$.members")) AS m