I wrote the code below and I want Print out the words in the first 10 sentences, and i want to remove all words that are not nouns, verbs, adjectives, adverbs, or proper names.but I dont know how? can anyone help me?
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
import spacy
nlp = spacy.load('en')
tokens = [[token.text for token in nlp(sentence)] for sentence in documents[:200]]
pos = [[token.pos_ for token in nlp(sentence)] for sentence in documents[:100]]
pos
All you need is to know which POS symbols are used to represent these entities. Here is the list from Spacy documentation. This code will help you with this requirement:
import spacy
nlp = spacy.load('en_core_web_sm') #you can use other methods
# excluded tags
excluded_tags = {"NOUN", "VERB", "ADJ", "ADV", "ADP", "PROPN"}
document = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
sentences = document[:10] #first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_ not in excluded_tags:
new_sentence.append(token.text)
new_sentences.append(" ".join(new_sentence))
Now, new_sentences
have the same sentences like before but without any Nouns, verbs, ... etc. You can make sure of that by iterating over sentences
and new_sentences
to see the different:
for old_sen, new_sen in zip(sentences, new_sentences):
print("Before:", old_sen)
print("After:", new_sen)
print()
Before: Loomings .
After: .
Before: Call me Ishmael .
After: me .
Before: Some years ago -- never mind how long precisely -- having little or no money in my purse , and nothing particular to interest me on shore , I thought I would sail about a little and see the watery part of the world .
After: Some -- -- or no my , and nothing to me , I I a and the the .
Before: It is a way I have of driving off the spleen and regulating the circulation .
After: It is a I have the and the .
...
...