Search code examples
regexvb.netreplacemultiple-columnspad

Regex.Replace to deidentify/normalize columnar text


VB2010: I am using RegEx to de-identify a block of text and also normalize the text. That is to say to take lines of text and de-identify the name and the confirmation code and then normalize the text so that data lines up in columns. I have almost all of it except the last part where the confirmation code is preceded by a variable number of dots and a package id that is 2 to 4 characters long or could be missing.

    'regex
    Dim MyRegex As Regex = New Regex("(?<pre>^\s{0,2}(\d{1,3})\s(\d\d))(.+?)(?<post>\.\.(\d{1,3})\." + "(\w)\s((\w+?|\.\.)))(?<dots>\.+)(\w{6})", RegexOptions.IgnoreCase Or RegexOptions.Multiline)

    'this is the replacement string
    Dim replacement As String = "${pre}******/*****${post}${dots}******"

    'replace the matched text in the InputText using the replacement pattern
    Dim result As String = MyRegex.Replace(Input, replacement)

My test input with a number, name, number, misc code, package id, and confirmation code on each line:

  1 01SMITH/CH..1.A E2T......AAABBB
  2 01MTC..1.A ..............CCCDDD
  3 01GRIFFIN/JOHN..1.A E2...EEEFFF
  4 01EL/MARY..1.Z E2XT......GGGHHH
  5 02BUBBA/BILLY..2.A E2....IIIJJJ
  6 01HILL/THOR..1.A E2WW....KKKLLL

My output so far:

  1 01******/*****..1.A E2T......******
  2 01******/*****..1.A ..............******
  3 01******/*****..1.A E2...******
  4 01******/*****..1.Z E2XT......******
  5 02******/*****..2.A E2....******
  6 01******/*****..1.A E2WW....******

I am de-identifying the name and the confirmation code but the code package id before the confirmation code is variable so that is throwing off my columnar output. Kind of stuck on the end part of it but am really close. I am aiming to do it one one regex but it may not be possible. Is it possible to pad a regex replacement?

Update with a solution:

    'regex (added one more group for the package id so I can determine its length)
    Dim MyRegex As Regex = New Regex("(?<pre>^\s{0,2}(\d{1,3})\s(\d\d))(.+?)(?<post>\.\.(\d{1,3})\.(\w)\s(?<pkid>(\w+?|\.\.)))(?<dots>\.+)(\w{6})", RegexOptions.IgnoreCase Or RegexOptions.Multiline)

    'use the MatchEvaluator to examine each match and adjust accordingly
    deid = MyRegex.Replace(deid, New MatchEvaluator(Function(m As Match)
                                                        Return m.Groups("pre").Value &
                                                            "******/*****" &
                                                            m.Groups("post").Value &
                                                            New String("."c, 5 - m.Groups("pkid").Value.Length) &
                                                            "******"
                                                    End Function))

I run that through the test data and here is what I get:

-----Input------------------------------------------------
1 01SMITH/CH..1.A E2T......AAABBB
2 01MTC..1.A ..............CCCDDD
3 01GRIFFIN/JOHN..1.A E2...EEEFFF
4 01EL/MARY..1.Z E2XT......GGGHHH
5 02BUBBA/BILLY..2.A E2....IIIJJJ
6 01HILL/THOR..1.A E2WW....KKKLLL
-----Output-----------------------------------------------
1 01******/*****..1.A E2T..******
2 01******/*****..1.A .....******
3 01******/*****..1.A E2...******
4 01******/*****..1.Z E2XT.******
5 02******/*****..2.A E2...******
6 01******/*****..1.A E2WW.******
----------------------------------------------------------

Solution

  • Perhaps, there can be a better way, but it is possible to achieve what you want with your regex and Regex.Replace using a MatchEvaluator.

    evaluator
    Type: System.Text.RegularExpressions.MatchEvaluator
    A custom method that examines each match and returns either the original matched string or a replacement string.

    The point is to get the length of Group 3 and Group 8, and repeat the * the same number of times. To add a forward slash, we can find the middle by dividing the length of Group 3 in 2. StrDup is a handy function that "multiplies" the string the specified number of times.

    Here is a VB.NET code:

    Dim Input As String = "1 01SMITH/CH..1.A E2T......AAABBB" & Environment.NewLine & "2 01MTC..1.A ..............CCCDDD" & Environment.NewLine & "3 01GRIFFIN/JOHN..1.A E2...EEEFFF" & Environment.NewLine & "4 01EL/MARY..1.Z E2XT......GGGHHH" & Environment.NewLine & "5 02BUBBA/BILLY..2.A E2....IIIJJJ" & Environment.NewLine & "6 01HILL/THOR..1.A E2WW....KKKLLL"
    Dim MyRegex As Regex = New Regex("(?<pre>^\s{0,2}(\d{1,3})\s(\d\d))(.+?)(?<post>\.\.(\d{1,3})\." + "(\w)\s((\w+?|\.\.)))(?<dots>\.+)(\w{6})", RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    Dim result As String = MyRegex.Replace(Input, New MatchEvaluator(Function(m As Match)
                                        Return m.Groups("pre").Value &
                                        StrDup(m.Groups(3).Value.Length, "*").Insert(m.Groups(3).Value.Length / 2, "/") &
                                        m.Groups("post").Value &
                                        m.Groups("dots").Value &
                                        StrDup(m.Groups(8).Value.Length, "*")
                                  End Function))
    Console.WriteLine(result)
    

    Result:

    1 01****/****..1.A E2T......******
    2 01**/*..1.A ..............******
    3 01******/******..1.A E2...******
    4 01****/***..1.Z E2XT......******
    5 02******/*****..2.A E2....******
    6 01****/*****..1.A E2WW....******