I am pulling data from a printable PDF using iTextSharp. This is the text that I have extracted:
Borrower: Guarantor:
{{0_SH}} By: {{1_SH}} (seal)
By: (seal)
Print Name:
Print Name:
Phillip Moore Phillip Moore
Date: {{1_DH}}
2/23/2022
Title: Owner
Date: {{0_DH}}
2/23/2022
12 of 12 (LOC 2020) Borrower Initials {{0_IH}}
And I have written this regex routine:
string pattern = @"Print\sName:\s(?'guarantor1'[a-zA-Z|\s|-|-|'|,|.|&|\d]+)\n";
Regex rgx = new Regex(pattern, RegexOptions.Singleline);
MatchCollection matches = rgx.Matches(fullText);
if (matches.Count > 0)
{
string guarantor1 = matches[0].Groups["guarantor1"].Value;
return guarantor1.Trim();
}
But the extracted data from the regex for guarantor1 is Phillip Moore Phillip Moore. I need just the first part Phillip Moore. Any ideas how to parse this correctly? There could also be a middle name or initial.
You could match the last occurrence of Print Name:
and then match as least as possible of the allowed chars until you encounter the same using a backreference until the end of the string.
Note that \s
can also match a newline.
\bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1$)
See a regex demo and a C# demo.
If there should also be a match without the double naming, the space and the backreference to group 1 can be optional.
\bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?=(?:\s\1)?$)
See another Regex demo.
Example code
string pattern = @"\bPrint\sName:\r?\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1\r?$)";
Regex rgx = new Regex(pattern, RegexOptions.Multiline);
MatchCollection matches = rgx.Matches(fullText);
if (matches.Count > 0)
{
string guarantor1 = matches[0].Groups["guarantor1"].Value;
Console.WriteLine(guarantor1.Trim());
}
Output
Phillip Moore