Search code examples
c#regexitext

C# Regex Parsing Or Names


I am pulling data from a printable PDF using iTextSharp. This is the text that I have extracted:

Borrower: Guarantor:
{{0_SH}} By: {{1_SH}} (seal)
By: (seal)
Print Name:
Print Name:
Phillip Moore Phillip Moore
Date: {{1_DH}}
2/23/2022
Title: Owner
Date: {{0_DH}}
2/23/2022
12 of 12 (LOC 2020) Borrower Initials {{0_IH}}

And I have written this regex routine:

string pattern = @"Print\sName:\s(?'guarantor1'[a-zA-Z|\s|-|-|'|,|.|&|\d]+)\n";
Regex rgx = new Regex(pattern, RegexOptions.Singleline);
MatchCollection matches = rgx.Matches(fullText);
if (matches.Count > 0)
{
    string guarantor1 = matches[0].Groups["guarantor1"].Value;
    return guarantor1.Trim();
}

But the extracted data from the regex for guarantor1 is Phillip Moore Phillip Moore. I need just the first part Phillip Moore. Any ideas how to parse this correctly? There could also be a middle name or initial.


Solution

  • You could match the last occurrence of Print Name: and then match as least as possible of the allowed chars until you encounter the same using a backreference until the end of the string.

    Note that \s can also match a newline.

    \bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1$)
    

    See a regex demo and a C# demo.

    If there should also be a match without the double naming, the space and the backreference to group 1 can be optional.

    \bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?=(?:\s\1)?$)
    

    See another Regex demo.

    Example code

    string pattern = @"\bPrint\sName:\r?\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1\r?$)";
    Regex rgx = new Regex(pattern, RegexOptions.Multiline);
    MatchCollection matches = rgx.Matches(fullText);
    if (matches.Count > 0)
    {
        string guarantor1 = matches[0].Groups["guarantor1"].Value;
        Console.WriteLine(guarantor1.Trim());
    }
    

    Output

    Phillip Moore