If "Who acted as (?P<role>.*) in (?P<movie>.*)"
is the template
I want to match for queries like "Who acted as tony montana in Scarface"
.
If the role name has a "in" here or If the movie name has an "in", the regex match will go wrong.
Eg: "Who acted as k in men in black" will give "k in men" as role.
May be a non greedy approach will work for this query but it will go for a toss if the movie contains the word "in". How do I get all possible interpretations here?
Given a phrase like 'a in b in c in d'
this will generate all possible partitions by the word in
:
words = phrase.split()
for n, w in enumerate(words):
if w == 'in':
print '(%s) in (%s) ' % (
' '.join(words[:n]),
' '.join(words[n+1:]))
For your specific problem, if there are three in
s in the phrase, the "middle" interpretation ((a in b) in (c in d)
) would be most probably correct, but with two in
s there's no way to solve this by the means of text manipulations, because "left" and "right" partitions are equally probable, consider:
Who acted as jeebs in men in black
Who acted as woman in red in matrix
You'll have to use NLP or database-driven methods to parse this correctly.