Search code examples
javastringparsingstreet-address

Java: Parse Australian Street Addresses


Looking for a quick and dirty way to parse Australian street addresses into its parts:
3A/45 Jindabyne Rd, Oakleigh, VIC 3166

should split into:
"3A", 45, "Jindabyne Rd" "Oakleigh", "VIC", 3166

Suburb names can have multiple words, as can street names.


See: Parse A Steet Address into components

Has to be in Java, cannot make http requests (e.g. to web APIs).


EDIT: Assume that format specified is always followed. I have no issue with spitting incorrectly formatted strings back at the user with a message telling them to follow the format (which I've described above).


Solution

  • Given your reply to my other answer, this should do for the strictly-formatted case you specify:

        String sample = "3A/45 Jindabyne Rd, Oakleigh, VIC 3166";
        Pattern pattern = Pattern.compile("(([^/ ]+)/)?([^ ]+) ([^,]+), ([^,]+), ([^ ]+) (\\d+)");
        Matcher m = pattern.matcher(sample);
        if (m.find()) {
            System.out.println("Unit: " + m.group(2));
            System.out.println("Number: " + m.group(3));
            System.out.println("Street: " + m.group(4));
            System.out.println("Suburb: " + m.group(5));
            System.out.println("State: " + m.group(6));
            System.out.println("Postcode: " + m.group(7));
        } else {
            throw new IllegalArgumentException("WTF");
        }
    

    This works if you remove the '3A/' (in which case m.group(2) will be null), if the street number is '45A' or '45-47', if we add a space to the road ('Jindabyne East Rd') or to the suburb ('Oakleigh South').

    Just to explain that regex further, if you're not familiar with regular expressions:

    (([^/ ]+)/)? is the equivalent of just ([^/ ]+/)? -- that is, 'anything not including a forward slash or a space, followed by a slash'. The question mark makes it optional (so the whole clause can be missing), and the extra parentheses in the final version are to create a smaller inner group, without the slash, for later extraction.

    ([^ ]+) is 'capture anything that's not a space (which is followed by a space)' -- this is the street number.

    ([^,]+), is 'capture anything that's not a comma (which is followed by comma and space)' -- this is the street name. Anything is valid in the street name as long as it's not a comma.

    ([^,]+), is the same again, in this case to capture the suburb.

    ([^ ]+) captures the next non-space string (state abbrevation) and skips the space after it.

    (\\d+) rounds off by capturing any number of digits (the postcode)

    Hope that's helpful.