python nlp bioinformatics gensim word2vec

Word2Vec error: TypeError: unhashable type: 'list'

I'm experimenting with peptide sequences and NLP right now and am trying to embed the peptide sequences using word2vec. The peptides come in a long string format (ex: 'KCNTATCATQRLANFLVRSSNNLGPVLPPTNVGSNTY'), so I've split the peptide sequences into trigrams. But as I'm trying to embed them, I keep getting this error: TypeError: unhashable type:'list.'

Not sure how to fix this error as I don't quite understand why it's coming up. My code is linked here, and here is the full error output:

TypeError                                 Traceback (most recent call last)
<ipython-input-17-966c68819734> in <module>()
      6 
      7 # embeddings pos
----> 8 w2vpos = Word2Vec(kmersdatapos, size=EMB_DIM,window=5,min_count=5,negative=15,iter=10,workers=multiprocessing.cpu_count())

4 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/word2vec.py in _scan_vocab(self, sentences, progress_per, trim_rule)
   1553                 )
   1554             for word in sentence:
-> 1555                 vocab[word] += 1
   1556             total_words += len(sentence)
   1557 

TypeError: unhashable type: 'list'

Any suggestions are appreciated!

Solution

You need to pass a list of list of strings to gensim's Word2Vec. In your code you are passing kmersdatapos to Word2Vec, which is list of list of list of strings. For example:

corpus = [["lorem", "ipsum"], ["dolor"], ["sit", "amet"]]

is a valid parameter for the Word2Vec function. Whereas,

corpus = [[["lorem", "ipsum"], ["dolor"]], [["sit", "amet"]]]

is invalid.