Search code examples
javaregexhadoopapache-pig

Apache Pig - MATCHES with multiple match criteria


I am trying to take a logical match criteria like:

(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ

and apply this as a match against a file in pig using

result = filter inputfields by text matches (some regex expression here));

The problem is I have no idea how to trun the logical expression above into a regex expression for the matches method.

I have fiddled around with various things and the closest I have come to is something like this:

((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b)

Any ideas? I also need to try to do this conversion programatically if possible.

Some examples:

a - The quick brown Foo jumped over the lazy test (This should pass as it contains foo and test)

b - the was something going on in TestZ (This passes also as it contains testZ)

c - the quick brown Foo jumped over the lazy dog (This should fail as it contains Foo but not test,testA or TestB)

Thanks


Solution

  • Since you're using Pig you don't actually need an involved regular expression, you can just use the boolean operators supplied by pig combined with a couple of easy regular expressions, example:

    T = load 'matches.txt' as (str:chararray);
    F = filter T by ((str matches '.*(Foo|Foo Bar|FooBar).*' and str matches '.*(test|testA|TestB).*') or str matches '.*TestZ.*');
    dump F;