I have the following table and I am trying to extract a string that only exists in certain columns and the string is between two different delimiters (, :).
df:
col1 | col2 |
---|---|
Patient 001 data retrieved: 9089800, John,Doe | CA |
Hospital stay | AZ |
Patient 002 data retrieved: 9123010, Steve,Doe | NY |
Patient 003 data retrieved: 9034291, Alex,Doe | MI |
Patient 004 information not found | VT |
df_final
col1 | col2 | result |
---|---|---|
Patient 001 data retrieved: 9089800, John,Doe | CA | 9089800 |
Hospital stay | AZ | |
Patient 002 data retrieved: 9123010, Steve,Doe | NY | 9123010 |
Patient 003 data retrieved: 9034291, Alex,Doe | MI | 9034291 |
Patient 004 information not found | VT |
I understand that the way the data is currently is not efficient but this is the dataset/task I have been given. Is there anyway to work around this?
his is what I have so far but it just retrieves the entire string for all rows. Not sure what I am doing wrong.
SELECT TOP 100 *,
SUBSTRING(col1,CHARINDEX('data retrieved:',col1)+1,
(((LEN(col1))-CHARINDEX(',', REVERSE(col1)))-CHARINDEX('data retrieved:',col1))) AS Result
FROM df
A little bit more bullet proof:
trim(case when charindex(':', col1) <> 0 then
case when charindex(',', col1, charindex(':', col1)+1) <> 0 then
substring(col1, charindex(':', col1)+1,
charindex(',', col1, charindex(':', col1)+1) -
charindex(':', col1) - 1
)
end
end)