I would like to do a stopword removal.
I have a list which consists of about 15,000 strings. those strings are little texts. My code is the following:
h = []
for w in clean.split():
if w not in cachedStopWords:
h.append(w)
if w in cachedStopWords:
h.append(" ")
print(h)
I understand that .split()
is necessary so that not every whole string is being compared to the list of stopwords. But it does not seem to work because it cannot split lists. (Without any kind of splitting h = clean, because nothing matches obviously.)
Does anyone have an idea how else I could split the different strings in the list while still preserving the different cases?
A very minimal example:
stops = {'remove', 'these', 'words'}
strings = ['please do not remove these words', 'removal is not cool', 'please please these are the bees\' knees', 'there are no stopwords here']
strings_cleaned = [' '.join(word for word in s.split() if word not in stops) for s in strings]
Or you could do:
strings_cleaned = []
for s in strings:
word_list = []
for word in s.split():
if word not in stops:
word_list.append(word)
s_string = ' '.join(word_list)
strings_cleaned.append(s_string)
This is a lot uglier (I think) than the one-liner before it, but perhaps more intuitive.
Make sure you're converting your container of stopwords to a set
(a hashable container which makes lookups O(1)
instead of list
s, whose lookups are O(n)
).
Edit: This is just a general, very straightforward example of how to remove stopwords. Your use case might be a little different, but since you haven't provided a sample of your data, we can't help any further.