I'm looking for help with cleaning a column in my data set so that I can join to another table.
The first data set is my complete data and includes what we call "reference_numbers" which relate to a specific case. Here is a dummy sample:
reference_number | case_opened | case_closed |
---|---|---|
01353568-00000001 | 11/01/2021 | 03/02/2022 |
09736473-00000009 | 21/04/2005 | 19/07/2021 |
05839576-00000012 | 13/09/2014 | 19/12/2017 |
09364857-00000006 | 13/09/2014 | 19/12/2017 |
As you can see, the "reference_number" is 8 digits then hyphen (-) and then another 8 digits. This is how a reference number should look.
My second data set is full of the same "reference_numbers". However, there is inconsistencies in the character length as they are often written differently by individuals:
reference_number | Case_workers |
---|---|
1353568-00000001 | 5 |
09736473-9 | 10 |
5839576-12 | 7 |
09364857-000000006 | 4 |
I want to be able to join the first data set onto the second data set using the reference_number. However, I need to clean them first. Is this possible and is there any efficient way of doing this?
Thanks
If the rules are so specific, you could try to use a combination of STRING_SPLIT
and STRING_AGG
:
SELECT
t.reference_number,
STRING_AGG(RIGHT('00000000'+s.value,8),'-') new_reference_number
FROM dbo.SecondTable t
CROSS APPLY STRING_SPLIT(t.reference_number,'-') s
GROUP BY t.reference_number
;
Using the sample data you posted, the results are:
╔════════════════════╦══════════════════════╗
║ reference_number ║ new_reference_number ║
╠════════════════════╬══════════════════════╣
║ 09364857-000000006 ║ 09364857-00000006 ║
║ 09736473-9 ║ 09736473-00000009 ║
║ 1353568-00000001 ║ 01353568-00000001 ║
║ 5839576-12 ║ 05839576-00000012 ║
╚════════════════════╩══════════════════════╝