Search code examples
pattern-matchingstanford-nlpstring-matching

TokensRegex: Using AND operators


TokensRegex (a module from Standford CoreNLP library) supports & (AND) operators. As I understand, you can use pattern 'X & Y' to match any sequences containing both X and Y. But when I used the operator in real code, it didn't work as the way I expected. Here is my Java code:

String content = "data is here and everywhere";
String pattern = "data & is";

TokenizerFactory tf = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
List<CoreLabel> tokens = tf.getTokenizer(new StringReader(content)).tokenize();
TokenSequencePattern seqPattern = TokenSequencePattern.compile(pattern);
TokenSequenceMatcher matcher = seqPattern.getMatcher(tokens);

if(matcher.find()){
      System.out.println("Matched"); // <- I expected to have this printed out
} else {
      System.out.println("Unmatched"); // <- But I've got this instead :(
}

Would you please tell me what's wrong with my code or my understanding? Thank you in advance.


Solution

  • For the example given, matcher.find() will attempt to find a subsequence in the input token sequence that matches both conditions:

    data: a sequence of one token with the word data

    is: a sequence of one token with the word is

    There is obviously no such sequence. If you want to check if your token sequence include both the word data and the word is, you can try the pattern:

    String pattern = "(?: ( []* data []* ) & ( []* is []* ))";

    The initial ?: indicates it doesn't need to do subgroup capturing, and the []* indicates a wildcard for any number of optional tokens.

    Although TokensRegex offers AND, it is not really part of normal regular expressions. It's likely that there are other ways (without the AND) to achieve what you want.