Search code examples
sqlsql-servert-sqlssmsdata-cleaning

Cleaning data by adding/removing characters to a string when it meets certain conditions T-SQL


I'm looking for help with cleaning a column in my data set so that I can join to another table.

The first data set is my complete data and includes what we call "reference_numbers" which relate to a specific case. Here is a dummy sample:

reference_number case_opened case_closed
01353568-00000001 11/01/2021 03/02/2022
09736473-00000009 21/04/2005 19/07/2021
05839576-00000012 13/09/2014 19/12/2017
09364857-00000006 13/09/2014 19/12/2017

As you can see, the "reference_number" is 8 digits then hyphen (-) and then another 8 digits. This is how a reference number should look.

My second data set is full of the same "reference_numbers". However, there is inconsistencies in the character length as they are often written differently by individuals:

reference_number Case_workers
1353568-00000001 5
09736473-9 10
5839576-12 7
09364857-000000006 4
  • The first reference_number in the second data set is missing the first "0"
  • The second reference_number in the second data set is missing seven "0" after the hyphen
  • The third reference_number in the second data set is missing both the first "0" and six "0" after the hyphen
  • The fourth reference_number in the second data set has too many digits after the hyphen (there is supposed to be seven 0's)

I want to be able to join the first data set onto the second data set using the reference_number. However, I need to clean them first. Is this possible and is there any efficient way of doing this?

Thanks


Solution

  • If the rules are so specific, you could try to use a combination of STRING_SPLIT and STRING_AGG:

    SELECT 
        t.reference_number,
        STRING_AGG(RIGHT('00000000'+s.value,8),'-') new_reference_number
    FROM dbo.SecondTable t
    CROSS APPLY STRING_SPLIT(t.reference_number,'-') s
    GROUP BY t.reference_number
    ;
    

    Using the sample data you posted, the results are:

    ╔════════════════════╦══════════════════════╗
    ║  reference_number  ║ new_reference_number ║
    ╠════════════════════╬══════════════════════╣
    ║ 09364857-000000006 ║ 09364857-00000006    ║
    ║ 09736473-9         ║ 09736473-00000009    ║
    ║ 1353568-00000001   ║ 01353568-00000001    ║
    ║ 5839576-12         ║ 05839576-00000012    ║
    ╚════════════════════╩══════════════════════╝