Search code examples
javaregexsalesforceapex-codeforce.com

How do I use a single regular expression to strip characters from matching substrings in Java/Apex?


I'm searching for state abbreviations in a string. Here's an example input string:

String inputStr = 'Albany, NY + Chicago, IL and IN, NY, OH and WI';

The pattern that I'm using to match state abbreviations is:

String patternStr = '(^|\\W|\\G)[a-zA-Z]{2}($|\\W)';

I'm looping through the matches and stripping out the non-alpha characters during the loop, but I know that I should be able to do that in one pass. Here's the current approach:

Pattern myPattern = Pattern.compile(patternStr);
Matcher myMatcher = myPattern.matcher(inputStr);
Pattern alphasOnly = Pattern.compile('[a-zA-Z]+');
String[] states = new String[]{};
while (myMatcher.find()) {
    String rawMatch = inputStr.substring(myMatcher.start(),myMatcher.end());
    Matcher alphaMatcher = alphasOnly.matcher(rawMatch);
    while (alphaMatcher.find()) {
        states.add(rawMatch.substring(alphaMatcher.start(),alphaMatcher.end()));
    }
}

System.debug(states);
|DEBUG|(NY, IL, IN, NY, OH, WI)

This works, but it's verbose and probably inefficient. What's the one-pass way to get this done in Java/Apex?


Solution

  • You need to use Matcher.group(). Try this:

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Escaping
    {
        public static void main(String[] args)
        {
            String inputStr = "Albany, NY + Chicago, IL and IN, NY, OH and WI";
            String patternStr = "(^|\\W|\\G)([a-zA-Z]{2})($|\\W)";
    
            Pattern myPattern = Pattern.compile(patternStr);
            Matcher myMatcher = myPattern.matcher(inputStr);
            StringBuilder states = new StringBuilder();
            while (myMatcher.find())
            {
                states.append(myMatcher.group(2));
                states.append(" ");
            }
    
            System.out.println(states);
        }
    }
    

    Output: NY IL IN NY OH WI

    In a real system, you'd want to verify against a list of all valid state abbreviations, otherwise you could pick up all sorts of junk.