Search code examples
pythonlistgensimstop-words

Preprocessing a list of list removing stopwords for doc2vec using map without losing words order


I am implementing a simple doc2vec with gensim, not a word2vec

I need to remove stopwords without losing the correct order to a list of list.

Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments

model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)

dataset = [['We should remove the stopwords from this example'],
     ['Otherwise the algo'],
     ["will not work correctly"],
     ['dont forget Gensim doc2vec takes list_of_list' ]]

STOPWORDS = ['we','i','will','the','this','from']


def word_filter(lst):
  lower=[word.lower() for word in lst]
  lst_ftred = [word for word in lower if not word in STOPWORDS]
  return lst_ftred

lst_lst_filtered= list(map(word_filter,dataset))
print(lst_lst_filtered)

Output:

[['we should remove the stopwords from this example'], ['otherwise the algo'], ['will not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]

Expected Output:

[[' should remove the stopwords   example'], ['otherwise the algo'], [' not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]

  • What was my mistake and how to fix?

  • There are other efficient ways to solve this issue without losing the proper order?


List of questions I examined before asking:

How to apply a function to each sublist of a list in python?

  • I studied this and tried to apply on my specific case

Removing stopwords from list of lists

  • The order is important I can't use set

Removing stopwords from a list of text files

  • This could be a possible solution is similar to what I have implemented.
  • I undestood that the difference, but I don't know how to deal with it. In my case the document is not tokenized (and should not be tokenized because is a doc2vec not a word2vec)

How to remove stop words using nltk or python

  • In this question the SO is dealing with a list not a list of list

Solution

  • lower is a list of one element, word not in STOPWORDS will return False. Take the first item in the list with index and split by blank space

    lst_ftred = ' '.join([word for word in lower[0].split() if word not in STOPWORDS])
    # output: ['should remove stopwords example', 'otherwise algo', 'not work correctly', 'dont forget gensim doc2vec takes list_of_list']
    # 'the' is also in STOPWORDS