I would like to extract a specific group of words from a list of comments scraped from one website to count them and use the most common of them in my TextBlob dictionary, that will be used in a simple sentiment analysis. To simplify: I would like to get all the adjectives, that might have positive or negative sentiment. komentarze
is a huge list of strings, every string is a sentence, which sentiment I would like to check. I want to create a list of words from this list of strings and then check which adjectives, that are not punctuation nor stopwords and are before a verb, are the most frequent. When I run my code, I get an error: IndexError: [E040] Attempt to access token at 18, max length 18. This error stands for Attempt to access token at {i}, max length {max_length}. I tried different codes, but none of them works.
Here is an example of a code that wants to proceed, but gives an E040 Error:
import spacy
import json
import pandas as pd
from spacy.lang.pl.stop_words import STOP_WORDS
from spacy.tokens import Token
from spacy.lang.pl.examples import sentences
from collections import Counter
with open('directory/file.json', mode='r') as f:
dane = json.load(f)
df = pd.DataFrame(dane)
komentarze = df['komentarz'].tolist()
nlp = spacy.load('pl_core_news_lg')
slowa = []
zwroty = []
for doc in nlp.pipe(komentarze):
#here I want to extract most common words
slowa += [token.text for token in doc if not token.is_stop and not token.is_punct]
#here I want to extract adjs, that are not puncts nor stop-words and are before a verb.
zwroty += [token.text for token in doc if (not token.is_stop and not token.is_punct and
token.pos_ == "ADJ" and doc[token.i + 1].pos_ == "VERB")]
zwroty_freq = Counter(zwroty)
common_zwroty = zwroty_freq.most_common(100)
print(common_zwroty)
When in the loop I run an additional adjsy += [token.text for token in doc if (not token.is_stop and not token.is_punct and token.pos_ == "ADJ")]
, everything works, but I simply cannot specify the word before or after that ADJ
.
I can iterate over a simple string via:
for token in doc:
if token.pos_ == 'ADJ':
if doc[token.i + 1].pos_ == 'VERB':
print('yaaay’)
but I genuinely have no idea how to set it up in my loop. I also tried:
for token in doc:
if not token.is_stop and not token.is_punct:
if token.pos_ == "ADJ":
if doc[token.i+1].pos_ == "NOUN" in range(1):
zwroty += token.text
but this gave me only letters.
How can I fix my problem to get what I would like to get?
Is it also possible the lower the text in this loop? I tried several times, but nothing worked…
EDITED:
I modified my code as @polm23 proposed. Well, it works, but I cannot join this method with my [w.lemma_ for w in doc if not w.is_stop and not w.is_punct and not w.like_num and w.pos_ == "VERB"]
list comprehension, that gives me an error: ValueError: [E195] Matcher can be called on Doc or Span only, got Token.
Here is a piece of code that, thanks to @polm23, works, but takes into account numbers, punctuations, etc:
import everything I need
with open('file.json', mode='r') as f:
dane = json.load(f)
df = pd.DataFrame(dane)
komentarze = df['komentarz'].tolist()
nlp = spacy.load('pl_core_news_lg')
matcher = Matcher(nlp.vocab)
patterns = [[{'POS':'ADJ'}, {'POS':'NOUN'}]]
matcher.add("demo", patterns)
zwroty =[]
for doc in nlp.pipe(komentarze):
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
zwroty += (match_id, string_id, start, end, span.text)
Here is a piece of code, that does not work, however, should take this into account:
for w in doc:
if not w.is_stop and not w.is_punct:
w.lemma_
matches = matcher(w)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = w[start:end]
zwroty += (match_id, string_id, start, end, span.text)
This is a perfect use case for spaCy's Matchers. Here's an example of matching ADJ NOUN in English:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
patterns = [
[{'POS':'ADJ'}, {'POS':'NOUN'}],
]
matcher.add("demo", patterns)
doc = nlp("There is a red card in the blue envelope.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
Output:
2193290520773312886 demo 3 5 red card
2193290520773312886 demo 7 9 blue envelope
You can use these matches in a Counter
or something to track the frequency if you want. You can also set up a function to run whenever something is matched.
Is it also possible the lower the text in this loop? I tried several times, but nothing worked…
Not entirely sure what you want to do, but if you have a match function you can just use the lower_
attribute on a token. Also take a look at lemma_
, which might be better, especially for verbs.
Not entirely sure I understand what you're trying to do, but it looks like the issue is you're trying to filter tokens and then pass them to the Matcher. Instead, use the Matcher on the docs and then filter its output.
Also, punctuation can never be an adjective, I'm not sure why you're checking for that.
out = []
for doc in docs:
matches = matcher(doc)
# because we are just matching [ADJ NOUN] we know the first token is ADJ
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
adj = doc[start]
# ignore stop words
if adj.is_stop: continue
# get the lemma
lemma = adj.lemma_
out += (adj, lemma) # or whatever