Search code examples
regexjava-8apache-commonsblacklist

Searching a Java 8 string for the existence of words from a list


Java 8 here. I am given a list of blacklisted words/expressions as well as an input string. I need to determine if any of those blacklisted items appears in the input string:

List<String> blacklist = new ArrayList<>();

// populate the blacklist and "normalize" it by removing whitespace and converting to lower case
blacklist.add("Call for info".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Travel".toLowerCase().replaceAll("\\s", ""));
blacklist.add("To be determined".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Meals".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Custom Call".toLowerCase().replaceAll("\\s", ""));
blacklist.add("Custom".toLowerCase().replaceAll("\\s", ""));

// obtain the input string and also "normalize" it
String input = getSomehow().toLowerCase().replaceAll("\\s", ""));

// now determine if any blacklisted words/expressions appear inside the input
for(String blItem : blacklist) {
    if (input.contains(blItem)) {
        throw new RuntimeException("IMPOSSSSSSSIBLE!")
    }
}

I thought this was working great until my input string contained the word "Customer" inside of it.

Since custom exists inside customer, the program is throwing an exception. Instead, I want it to be allowed, because "customer" is not a blacklisted word.

So I think the actual logic here is:

  • If the input string contains a blacklist word...
  • ...AND the blacklist word is preceded by either the beginning of the string or a non-alphabetical ([a-z]) character...
  • ...AND the blacklist word is succeeded by either the end of the string or a non-alphabetical charatcer...
  • ...then throw the exception

I think that would cover all my bases.

Does Java 8 or any (Apache or otherwise) "commons" library have anything that will help me here? For some reason I'm having a hard time wrapping my head around this and making the code look elegant (I'm not sure how to check for the beginning/ending of a string from inside a regex, etc.).

Any ideas?


Solution

  • You can pre-compile a list of Patterns for the given words.

    \b indicates a word boundary. Adding a word boundary on both sides of a String will match the regex for exact words.

    List<Pattern> blackListPatterns =
        blackList
            .stream()
            .map(
                    word -> Pattern.compile("\\b" + Pattern.quote(word) + "\\b")
            ).collect(Collectors.toList());
    

    Then you can match the word with the Pattern List.

    If you are sure your word will not contain any metacharacters like (,* .etc, you can directly create your Pattern from the String instead of using Pattern.quote(), which is used to escape metacharacters.

    for (Pattern pattern : blackListPatterns) {
        if (pattern.matcher(input).find()) {
            throw new RuntimeException("IMPOSSSSSSSIBLE!")
        }
    }