I have a json array that I need to pull a string from. We're storing this data in Bigquery and for the most part it's been pretty easy, except for extracting the id from the user_id field from the array. There are seven possible strings on how the user_id
could appear. For instance, in the json array, user_id
could be something like this: "user_id": "client-id-12345678-abcd-0926-zyxw-de07e3f9ce85"
or it could be "user_id": "operator-id-12345678-abcd-0926-zyxw-de07e3f9ce85"
or "user_id": "auto_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
Here is what the sample json looks like:
{
"members": [
{
"channel_count": 0,
"profile_url": "something.png",
"push_enabled": true,
"push_trigger_option": "default",
"state": "joined",
"user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
}
]
}
I thought the best way to handle this originally was to run a sql replace like so:
replace(replace('client-us-operator-us-iddddddd','client-us-', '' ), 'operator-us-', '')
But because there are a couple different user_id prefixes it's making it difficult. I am not sure if those are all the possible edge cases. If it's just client and operator it's 12345678-abcd-0926-zyxw-de07e3f9ce85
and that works as expected. When it's "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
it pulls in the whole string, and sometime the user_id
appears as "agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
I just need everything to the right of the prefix. So in this case it would be 12345678-abcd-0926-zyxw-de07e3f9ce85
. However depending on if it's an automated message, or a user responding the prefix length changes, as well as the hyphen or underscore.
I'm not the strongest when using regex, so I'm on here to see if there a better way of handling the extraction of this id from the whole user_id string.
REGEXP
is what you will want to use since it allows for a defined set of pattern matching both for including and excluding. Answer at bottom.
Breakdown of what the functions are doing:
JSON_EXTRACT_SCALAR(json_string, json_path)
$.fieldname
to point to the root of the JSON object (the $), and the specific field to locate (.fieldname).JSON_EXTRACT_SCALAR(your_json_object, '$.user_id')
-From my json, find the root key value pair for user_id.
REGEXP_EXTRACT(string, regex_pattern)
:
for a string, match a pattern.
JSON_EXTRACT_SCALAR()
r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$'
^
points to start of string(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)
non capturing group. (?...)
syntax groups prefixes without capturing them in the output. This is where you would put your different prefix possibilities.([a-zA-Z0-9-]+)
This is the capturing group that you want in the output. from left to right it says: capture anything that is from a-z (lowercase) A-Z (uppercase), 0-9 (digits), - (hyphens). If you expect ids to have other captured formating, you would include it here. Such as a period or slash.$
says the end of the string.WITH sample_data AS (
SELECT '''
{
"members": [
{
"channel_count": 0,
"profile_url": "something.png",
"push_enabled": true,
"push_trigger_option": "default",
"state": "joined",
"user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
},
{
"channel_count": 1,
"profile_url": "another.png",
"push_enabled": false,
"push_trigger_option": "custom",
"state": "joined",
"user_id": "operator-id-87654321-zyxw-0926-abcd-ce85de07e3f9"
}
]
}
''' AS json_data
)
SELECT
REGEXP_EXTRACT(JSON_EXTRACT_SCALAR(m, '$.user_id'), r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$') AS extracted_user_id
FROM
sample_data,
UNNEST(JSON_EXTRACT_ARRAY(json_data, "$.members")) AS m