I have this XML file that contains more than 2000 phrases, below is a small sample.
<TEXT>
<PHRASE>
<V>played</V>
<N>John</N>
<PREP>with</PREP>
<en x='PERS'>Adam</en>
<PREP>in</PREP>
<en x='LOC'> ASL school/en>
</PHRASE>
<PHRASE>
<V y='0'>went</V>
<en x='PERS'>Mark</en>
<PREP>to</PREP>
<en x='ORG>United Nations</en>
<PREP>for</PREP>
<PREP>a</PREP>
<N>visit</N>
</PHRASE>
<PHRASE>
<PREP>in</PREP>
<en x='DATE'>1987</en>
<en x='PERS'>Nick</en>
<V>founded</V>
<en x='ORG'>XYZ company</en>
</PHRASE>
<PHRASE>
<en x='ORG'>Google's</en>
<en x='PERS'>Frank</en>
<V>went</V>
<N>yesterday</N>
<PREP>to</PREP>
<en x='LOC'>San Fransisco/en>
</PHRASE>
</TEXT>
And I have a list of patterns:
finalPatterns=['went \n to \n','created\n the\n', 'founded\n a\n', 'went\n yesterday\n to\n', 'a\n visit\n', 'founded\n in\n']
What I want is to take each finalPattern for example: went to and search for its presence in each phrase in the text, if any phrase contains both went AND to then it print out its 2 <en>
tags. [Not if en tags not equal to PERS & ORG nothing is printed]
When it searches for:
-"went" & "to" --> this is the output: Frank -San Fransisco
-"founded" & "in" --> output: Nick-XYZ Company
That's what I did but it didn't work. Nothing was printed.
for phrase in root.findall('./PHRASE'):
ens = {en.get('x'): en.text for en in phrase.findall('en')}
if 'ORG' in ens and 'PERS' in ens:
if all(word in phrase for word in finalPatterns):
x="".join(phrase.itertext()) #print whats in between [since I would also like to print the whole sentence]
print("ORG is: {}, PERS is: {} /".format(ens["ORG"],ens["PERS"]))
This should do the trick:
phrasewords = [w.text for w in phrase.findall('V')+phrase.findall('N')+phrase.findall('PREP')]
for words in finalPatterns:
if all(word in phrasewords for word in words.split()):
print "found"