This is my regex pattern :
\b(?!(?i)Doctor| )([^\s]*\b)[\n\r\s]+(\b[\S\s][^\d]*\b)[\s\S]+([0-9]{5})\s+([\D]*)
I need to retrieve the differents informations in bloc address, ex :
Doctor John DOE
123 dream road
12345 TOWN
Java Code:
firstName = matcher.group(1).trim();
lastName = matcher.group(2).trim();
It works fine ! But sometimes there are an additional line :
Doctor John DOE
Country Hospital
123 dream road
12345 TOWN
Firstname is well retrieved: "John" but the lastname retrieved is "DOE Country Hospital"
The ideal aim would be to get firstname, lastname, address line 1 (if present....), Address line 2, Code and town in separate field.
but I don't find the correct pattern...
Avoid using [\s\S]
and instead use .
if you don't want to capture newlines too. Usage of [\s\S]
is leading to capturing of DOE Country Hospital
because there is a newline after DOE
and [\s\S]
will capture that newline but .
won't unless you use (?s)
modifier which makes .
to capture newline too.
With the two examples you have in your post, where first line of address is optional, you can use following regex to capture different parts in your address. I have simplified your regex and used \R
to match a newline as \R
matches almost all variants of newlines in various operating systems.
^(?:(?i)Doctor )?(?<FirstName>\S+)\s+(?<LastName>\S+)(?:\R(?<AddressLine1>.+))?\R(?:(?<AddressLine2>.+))\R(?<Code>\d{5})\s+(?<Town>.+)$
Notice, I have used named groups to make it easy to spot what is what. You can remove it and make the regex shorter. Let me know if this works for you and I will add explanation as you need.
Edit for working Java code,
public static void main(String[] args) {
final String regex = "^(?:(?i)Doctor )?(?<FirstName>\\S+)\\s+(?<LastName>\\S+)(?:\\R(?<AddressLine1>.+))?\\R(?:(?<AddressLine2>.+))\\R(?<Code>\\d{5})\\s+(?<Town>.+)$";
final String string = "Doctor John DOE\n"
+ "123 dream road \n"
+ "12345 TOWN\n\n\n"
+ "Doctor John DOE\n"
+ "Country Hospital\n"
+ "123 dream road \n"
+ "12345 TOWN";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group()+ "\n");
System.out.println("Group FirstName: " + matcher.group("FirstName"));
System.out.println("Group LastName: " + matcher.group("LastName"));
System.out.println("Group AddressLine1: " + matcher.group("AddressLine1"));
System.out.println("Group AddressLine2: " + matcher.group("AddressLine2"));
System.out.println("Group Code: " + matcher.group("Code"));
System.out.println("Group Town: " + matcher.group("Town"));
System.out.println("\n\n");
}
}
And this produces following output,
Full match: Doctor John DOE
123 dream road
12345 TOWN
Group FirstName: John
Group LastName: DOE
Group AddressLine1: null
Group AddressLine2: 123 dream road
Group Code: 12345
Group Town: TOWN
Full match: Doctor John DOE
Country Hospital
123 dream road
12345 TOWN
Group FirstName: John
Group LastName: DOE
Group AddressLine1: Country Hospital
Group AddressLine2: 123 dream road
Group Code: 12345
Group Town: TOWN