Search code examples
.netcollectionsdeduplication

Neatest way to get a distinct list of phone numbers (without removing original formatting)?


We have a master Person record and one (or more) duplicate Persons and we are merging their data, prioritising the master over the duplicate(s).

When it comes to phone numbers the goal is to merge their data, with a single phone number going into the Phone field and any other phone numbers going into a notes field (so as not to discard them completely). Records may or may not contain a phone number.

For neatness we don't want to add to the notes field a bunch of numbers which are basically the same. So we don't want the field to contain:

(1234) 123123
1234 123123

This would be easy if we could just discard the formatting and spaces but we need to retain those (except for white space on the beginning/end).

We started by creating a Structure (not sure why we have a Structure versus a Class, but anyway)

Friend Structure PhoneNumber

Private _Raw As String
Public Property Raw() As String
    Get
        Return _Raw
    End Get
    Set(ByVal value As String)
        _Raw = value
    End Set
End Property


Private _Stripped As String
Public Property Stripped() As String
    Get
        Return _Stripped
    End Get
    Set(ByVal value As String)
        _Stripped = value
    End Set
End Property


Sub New(ByVal num As String)
    Raw = num
    Dim RegexObj As New System.Text.RegularExpressions.Regex("[^\d]")
    Stripped = RegexObj.Replace(num, "")
    MsgBox(num & vbCrLf & Stripped)

End Sub
End Structure

Then, the merge code looks like this:

    Dim phones As New List(Of PhoneNumber)
    If master.Phone.Trim.Length > 1 Then
        phones.Add(New PhoneNumber(master.Phone.Trim))
    End If
    For Each x As Person In duplicates
        If x.Phone.Trim.Length > 1 And Not phones.Contains(New PhoneNumber(x.Phone.Trim)) Then
            phones.Add(New PhoneNumber(x.Phone.Trim))
        End If
    Next
    If phones.Count > 0 Then
        master.Phone = phones(0).Raw
    End If
    For i = 1 To phones.Count - 1
        master.Notes &= vbCrLf & "Alt. Phone: " & phones(i).Raw
    Next

But, obviously, the problem here is it's allowing the duplicates.

We kind of want the Contains to match on "stripped" values only, but of course it doesn't know to do that.

This already seems like too much code for such a minor feature, but at the moment we're looking at writing something (in the Structure?) that will replace the Contains and match on stripped only. Is there a neater way?

Code is in VB, but C# answers welcome.

Remember too that we have to prioritise the master, so if we use LINQ and Distinct we need to ensure we don't lose the sort order (that's my understanding).


Solution

  • Figured out a better way to do this was to use a Dictionary. That way we can do without the Structure and use Dictionary lookups on both the Key (the stripped phone number) and the Value (the formatted original).

    Something like this:

        Dim RegexObj As New System.Text.RegularExpressions.Regex("[^\d]")
        Dim phones As New Dictionary(Of String, String)
        master.Phone = master.Phone.Trim
        If master.Phone.Length > 1 Then
            phones.Add(RegexObj.Replace(master.Phone, ""), master.Phone)
        End If
        For Each x As Person In duplicates
            x.Phone = x.Phone.Trim
            If x.Phone.Length > 1 And Not phones.ContainsKey(RegexObj.Replace(x.Phone, "")) Then
                phones.Add(RegexObj.Replace(x.Phone, ""), x.Phone)
            End If
        Next
        If phones.Count > 0 Then
            master.Phone = phones.First.Value
            phones.Remove(phones.First.Key)
        End If
        For Each entry As KeyValuePair(Of String, String) In phones
            master.Notes &= IIf(String.IsNullOrEmpty(master.Notes.Trim), "", vbCrLf).ToString _
                & "Alt. Phone: " & entry.Value
        Next