Search code examples
pattern-matchingdata-conversionname-matching

Matching records based on Person Name


Are there any tools or methods that can be used for matching by a person's name between two different data sources?

The systems have no other common information and the names have been entered differently in many cases.

Examples of non-exact matches:

King Jr., Martin Luther = King, Martin (exclude suffix)
Erving, Dr. J. = Erving, J. (exclude prefix)
Obama, Barak Hussein = Obama, Barak (exclude middle name)
Pufnstuf, H.R. = Pufnstuf, Haibane Renmei (match abbreviations)
Tankengine, Thomas = Tankengine, Tom (match common nicknames)
Flair, Rick "the Natureboy" = Flair, Natureboy (match on nickname)


Solution

  • I had to use a variety of techniques suggested. Thanks pointing me in the right direction(s). Hopefully, the following will help someone else out with this type of problem to solve.

    Removing excess characters

    CREATE FUNCTION [dbo].[fn_StripCharacters]
    (
        @String NVARCHAR(MAX), 
        @MatchExpression VARCHAR(255)
    )
    RETURNS NVARCHAR(MAX)
    AS
    BEGIN
        SET @MatchExpression =  '%['+@MatchExpression+']%'
    
        WHILE PatIndex(@MatchExpression, @String) > 0
            SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')
    
        RETURN @String
    
    END
    

    Usage:

    --remove all non-alphanumeric and non-white space  
    dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')  
    

    Split name into parts

    CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
    RETURNS @ResultList TABLE 
        (
            [ID] VARCHAR(MAX),
            [Val] VARCHAR(MAX)
        )
    AS
    BEGIN
    
    declare @OuterCursor cursor
    declare @ID varchar(max)
    declare @Val varchar(max)
    
    set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY
    
    open @OuterCursor
    
    fetch next from @OuterCursor into @ID, @Val
    
    while (@@FETCH_STATUS=0)
    begin
    
        INSERT INTO @ResultList (ID, Val)   
        select @ID, split.s from dbo.Split(@sep, @Val) as split 
               where len(split.s) > 0
    
        fetch next from @OuterCursor into @ID, @Val
    end
    
    close @OuterCursor
    deallocate @OuterCursor 
    
    CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
    RETURNS table
    AS
    RETURN (
        WITH Pieces(pn, start, stop) AS (
          SELECT 1, 1, CHARINDEX(@sep, @s)
          UNION ALL
          SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
          FROM Pieces
          WHERE stop > 0
        )
        SELECT pn,
          LTRIM(RTRIM(SUBSTRING(@s, start, 
                 CASE WHEN stop > 0 
                      THEN stop-start 
                      ELSE 8000 
                 END))) AS s
        FROM Pieces
      )
    
    RETURN
    

    Usage:

    --create split name list
    DECLARE @NameList StringList 
    
    INSERT INTO @NameList (ID, Val)
    SELECT id, firstname FROM dbo.[User] u
    WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0 
    
    ----remove split dups
    select u.ID, COUNT(*)
    from dbo.import_SplitTable(' ', @NameList) splitList
    INNER JOIN dbo.[User] u
    ON splitList.id = u.id
    

    Common nicknames:

    I created a table based on this list and used it to join on common name equivalents.

    Usage:

    SELECT u.id
    , u.FirstName
    , u_nickname_maybe.Name AS MaybeNickname
    , u.LastName
    , c.ID AS ContactID from
    FROM dbo.[User] u 
    INNER JOIN nickname u_nickname_match
    ON u.FirstName = u_nickname_match.Name
    INNER JOIN nickname u_nickname_maybe
    ON u_nickname_match.relatedid = u_nickname_maybe.id
    LEFT OUTER JOIN
    (
        SELECT c.id, c.LastName, c.FirstName, 
             c_nickname_maybe.Name AS MaybeFirstName
        FROM dbo.Contact c
        INNER JOIN nickname c_nickname_match
        ON c.FirstName = c_nickname_match.Name
        INNER JOIN nickname c_nickname_maybe
        ON c_nickname_match.relatedid = c_nickname_maybe.id
        WHERE c_nickname_match.Name <> c_nickname_maybe.Name
    ) as c
    ON c.AccountHolderID = ah.ID 
           AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
    WHERE u_nickname_match.Name <> u_nickname_maybe.Name
    

    Phonetic algorithms (Jaro Winkler):

    The amazing article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms. I ended up mostly using Jaro Winkler to match the names.

    Usage:

    SELECT
    u.id AS UserID
    ,c.id AS ContactID
    ,u.FirstName
    ,c.FirstName 
    ,u.LastName
    ,c.LastName
    ,maxResult.CombinedScores
     from
    (
        SELECT
          u.ID
        , 
            max(
                dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))  
                * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
            ) AS CombinedScores
        FROM dbo.[User] u, dbo.[Contact] c
        WHERE u.ContactID IS NULL
        GROUP BY u.id
    ) AS maxResult
    INNER JOIN dbo.[User] u
    ON maxResult.id  = u.id
    INNER JOIN dbo.[Contact] c
    ON maxResult.CombinedScores = 
    dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName)) 
    * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))