I'm experimenting with peptide sequences and NLP right now and am trying to embed the peptide sequences using word2vec.
The peptides come in a long string format (ex: 'KCNTATCATQRLANFLVRSSNNLGPVLPPTNVGSNTY'), so I've split the peptide sequences into trigrams. But as I'm trying to embed them, I keep getting this error: TypeError: unhashable type:'list.'
Not sure how to fix this error as I don't quite understand why it's coming up. My code is linked here, and here is the full error output:
TypeError Traceback (most recent call last)
<ipython-input-17-966c68819734> in <module>()
6
7 # embeddings pos
----> 8 w2vpos = Word2Vec(kmersdatapos, size=EMB_DIM,window=5,min_count=5,negative=15,iter=10,workers=multiprocessing.cpu_count())
4 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/word2vec.py in _scan_vocab(self, sentences, progress_per, trim_rule)
1553 )
1554 for word in sentence:
-> 1555 vocab[word] += 1
1556 total_words += len(sentence)
1557
TypeError: unhashable type: 'list'
Any suggestions are appreciated!
You need to pass a list of list of strings to gensim's Word2Vec. In your code you are passing kmersdatapos to Word2Vec, which is list of list of list of strings. For example:
corpus = [["lorem", "ipsum"], ["dolor"], ["sit", "amet"]]
is a valid parameter for the Word2Vec function. Whereas,
corpus = [[["lorem", "ipsum"], ["dolor"]], [["sit", "amet"]]]
is invalid.