This is a follow-up of my question. I am using nltk to parse out persons, organizations, and their relationships. Using this example, I was able to create chunks of persons and organizations; however, I am getting an error in the nltk.sem.extract_rel command:
AttributeError: 'Tree' object has no attribute 'text'
Here is the complete code:
import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)
# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]
# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+\s+as\s+')
for doc in chunked_sentences:
for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
print nltk.sem.show_raw_rtuple(rel)
This example is very similar to the one given in the book, but the example uses prepared 'parsed docs,' which appears of nowhere and I don't know where to find its object type. I scoured thru the git libraries as well. Any help is appreciated.
My ultimate goal is to extract persons, organizations, titles (dates) for some companies; then create network maps of persons and organizations.
It looks like to be a "Parsed Doc" an object needs to have a headline
member and a text
member both of which are lists of tokens, where some of the tokens are marked up as trees. For example this (hacky) example works:
import nltk
import re
IN = re.compile (r'.*\bin\b(?!\b.+ing)')
class doc():
pass
doc.headline=['foo']
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']
for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
print nltk.sem.relextract.show_raw_rtuple(rel)
When run this provides the output:
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
Obviously you wouldn't actually code it like this, but it provides a working example of the data format expected by extract_rels
, you just need to determine how to do your preprocessing steps to get your data massaged into that format.