For this project, I am using the Wikipedia, spacy, and textacy.extract modules.
I used the wikipedia module to grab the page I set my subject to. It will returns a string of its contents.
Then, I use the textacy.extract.semistructured_statements() to filter out facts. It takes in two required args. The first one is the document, and the second one is the entity.
For testing purposes, I have tried setting the subject to Ubuntu and Bill Gates.
#The Subject we are looking for
subject = 'Bill Gates'
#The Wikipedia Page
wikiResults = wikipedia.search(subject)
wikiPage = wikipedia.page(wikiResults[0]).content
#Spacy
nlp = spacy.load("en_core_web_sm")
document = nlp(wikiPage)
#Textacy.Extract
statments = textacy.extract.semistructured_statements(document, subject)
for statement in statements:
subject, verb, fact = statement
print(fact)
So when I run the program, I am returned with multiple results from searching Ubuntu, but not Bill Gates. Why is this and how can I improve my code to extract more facts out of a Wikipedia page?
You need to process the document using different cues to extract common verbs used to describe a subject, also you need to split the string if you have multiple words to search. For example for Bill Gates you will need to search for 'Bill', 'Gates', 'Bill Gates' combinations and you need to extract different cue base verbs used to describe a person/object of intrest.
So for example searching for 'Gates':
statments = textacy.extract.semistructured_statements(document, "Gates", cue = 'have', max_n_words = 200, )
will get you more stuff like:
* entity: Gates , cue: had , fact: primary responsibility for Microsoft's product strategy from the company's founding in 1975 until 2006
* entity: Gates , cue: is , fact: notorious for not being reachable by phone and for not returning phone calls
* entity: Gates , cue: was , fact: the second wealthiest person behind Carlos Slim, but regained the top position in 2013, according to the Bloomberg Billionaires List
* entity: Bill , cue: were , fact: the second-most generous philanthropists in America, having given over $28 billion to charity
* entity: Gates , cue: was , fact: seven years old
* entity: Gates , cue: was , fact: the guest on BBC Radio 4's Desert Island Discs on January 31, 2016, in which he talks about his relationships with his father and Steve Jobs, meeting Melinda Ann French, the start of Microsoft and some of his habits (for example reading The Economist "from cover to cover every week
* entity: Gates , cue: was , fact: the world's highest-earning billionaire in 2013, as his net worth increased by US$15.8 billion to US$78.5 billion
Please note that the verbs can be negative like in the 2 result!
I also noted that using a max_n_words with more than the default 20 words can lead to more intresting statements.
Here is my complete script:
import wikipedia
import spacy
import textacy
import en_core_web_sm
subject = 'Bill Gates'
#The Wikipedia Page
wikiResults = wikipedia.search(subject)
#print("wikiResults:", wikiResults)
wikiPage = wikipedia.page(wikiResults[0]).content
print("\n\nwikiPage:", wikiPage, "'\n")
nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()
for word in ["Gates", "Bill", "Bill Gates"]:
for cue in ["be", "have", "write", "talk", "talk about"]:
statments = textacy.extract.semistructured_statements(document, word, cue = cue, max_n_words = 200, )
for statement in statments:
uniqueStatements.add(statement)
print("found", len(uniqueStatements), "statements.")
for statement in uniqueStatements:
entity, cue, fact = statement
print("* entity:",entity, ", cue:", cue, ", fact:", fact)
Varing subjects and cues verb get me 23 results instead of one.