I am implementing a simple doc2vec
with gensim
, not a word2vec
I need to remove stopwords without losing the correct order to a list of list.
Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments
model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)
dataset = [['We should remove the stopwords from this example'],
['Otherwise the algo'],
["will not work correctly"],
['dont forget Gensim doc2vec takes list_of_list' ]]
STOPWORDS = ['we','i','will','the','this','from']
def word_filter(lst):
lower=[word.lower() for word in lst]
lst_ftred = [word for word in lower if not word in STOPWORDS]
return lst_ftred
lst_lst_filtered= list(map(word_filter,dataset))
print(lst_lst_filtered)
Output:
[['we should remove the stopwords from this example'], ['otherwise the algo'], ['will not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
Expected Output:
[[' should remove the stopwords example'], ['otherwise the algo'], [' not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
What was my mistake and how to fix?
There are other efficient ways to solve this issue without losing the proper order?
List of questions I examined before asking:
How to apply a function to each sublist of a list in python?
Removing stopwords from list of lists
Removing stopwords from a list of text files
How to remove stop words using nltk or python
lower
is a list of one element, word not in STOPWORDS
will return False
. Take the first item in the list with index and split by blank space
lst_ftred = ' '.join([word for word in lower[0].split() if word not in STOPWORDS])
# output: ['should remove stopwords example', 'otherwise algo', 'not work correctly', 'dont forget gensim doc2vec takes list_of_list']
# 'the' is also in STOPWORDS