Search code examples
javaregexmatching

Java Regex exclude one line in bloc address


This is my regex pattern :

\b(?!(?i)Doctor| )([^\s]*\b)[\n\r\s]+(\b[\S\s][^\d]*\b)[\s\S]+([0-9]{5})\s+([\D]*)

I need to retrieve the differents informations in bloc address, ex :

Doctor John DOE
123 dream road 
12345 TOWN

Java Code:

firstName = matcher.group(1).trim();
lastName = matcher.group(2).trim();

It works fine ! But sometimes there are an additional line :

Doctor John DOE
Country Hospital
123 dream road 
12345 TOWN

Firstname is well retrieved: "John" but the lastname retrieved is "DOE Country Hospital"

The ideal aim would be to get firstname, lastname, address line 1 (if present....), Address line 2, Code and town in separate field.

but I don't find the correct pattern...

EDIT: java code enter image description here


Solution

  • Avoid using [\s\S] and instead use . if you don't want to capture newlines too. Usage of [\s\S] is leading to capturing of DOE Country Hospital because there is a newline after DOE and [\s\S] will capture that newline but . won't unless you use (?s) modifier which makes . to capture newline too.

    With the two examples you have in your post, where first line of address is optional, you can use following regex to capture different parts in your address. I have simplified your regex and used \R to match a newline as \R matches almost all variants of newlines in various operating systems.

    ^(?:(?i)Doctor )?(?<FirstName>\S+)\s+(?<LastName>\S+)(?:\R(?<AddressLine1>.+))?\R(?:(?<AddressLine2>.+))\R(?<Code>\d{5})\s+(?<Town>.+)$
    

    Demo

    Notice, I have used named groups to make it easy to spot what is what. You can remove it and make the regex shorter. Let me know if this works for you and I will add explanation as you need.

    Edit for working Java code,

    public static void main(String[] args) {
        final String regex = "^(?:(?i)Doctor )?(?<FirstName>\\S+)\\s+(?<LastName>\\S+)(?:\\R(?<AddressLine1>.+))?\\R(?:(?<AddressLine2>.+))\\R(?<Code>\\d{5})\\s+(?<Town>.+)$";
        final String string = "Doctor John DOE\n"
     + "123 dream road \n"
     + "12345 TOWN\n\n\n"
     + "Doctor John DOE\n"
     + "Country Hospital\n"
     + "123 dream road \n"
     + "12345 TOWN";
        
        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);
        
        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group()+ "\n");
            System.out.println("Group FirstName: " + matcher.group("FirstName"));
            System.out.println("Group LastName: " + matcher.group("LastName"));
            System.out.println("Group AddressLine1: " + matcher.group("AddressLine1"));
            System.out.println("Group AddressLine2: " + matcher.group("AddressLine2"));
            System.out.println("Group Code: " + matcher.group("Code"));
            System.out.println("Group Town: " + matcher.group("Town"));
            System.out.println("\n\n");
        }
    }
    

    And this produces following output,

    Full match: Doctor John DOE
    123 dream road 
    12345 TOWN
    
    Group FirstName: John
    Group LastName: DOE
    Group AddressLine1: null
    Group AddressLine2: 123 dream road 
    Group Code: 12345
    Group Town: TOWN
    
    
    
    Full match: Doctor John DOE
    Country Hospital
    123 dream road 
    12345 TOWN
    
    Group FirstName: John
    Group LastName: DOE
    Group AddressLine1: Country Hospital
    Group AddressLine2: 123 dream road 
    Group Code: 12345
    Group Town: TOWN