I have some volunteer essay writings in the format of:
volunteer_names, essay
["emi", "jenne", "john"], [["lets", "protect", "nature"], ["what", "is", "nature"], ["nature", "humans", "earth"]]
["jenne", "li"], [["lets", "manage", "waste"]]
["emi", "li", "jim"], [["python", "is", "cool"]]
...
...
...
I want to identify the similar users based on their essay writings. I feel like word2vec is more suitable in problems like this. However, since I want to embed user names too in the model I am not sure how to do it. The examples I found in the internet only uses the words (See example code).
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
In that case, I am wondering if there is special way of doing this in word2vec or can I simply consider user names as just words to input to the model. please let me know your thoughts on this.
I am happy to provide more details if needed.
Word2vec infers the word representation from surrounding words: words similarly often appear in a similar company end up with similar vectors. Usually, a window of 5 words is considered. So, if you want to hack Word2vec, you would need to make sure that the student names will appear frequently enough (perhaps at a beginning and at the end of a sentence or something like that).
Alternatively, you can have a look at Doc2vec. During training, each document gets an ID and learns an embedding for the ID, they are in a lookup table as if they were word embeddings. If you use student names as document IDs, you would get student embeddings. If you have multiple essays from one student, I suppose you would need to hack Gensim a little bit not to have a unique ID for each essay.