I'm using python gensim package for word2vec.
I want to run the model on tokenize words and 2-words phrase. I have 10,000~ documents and I used the nltk Regextoknizer to get the single word tokens from all the documents. How can I tokenizer the document to get also the 2-words phrase.
For example:
document: "I have a green apple"
and the 2 word phrase: {I_have}, {green_apple}, ... etc.
one option is to use ngrams
from nltk
and set n=2 like this to get a list of tuples:
from nltk import ngrams
n = 2
bigram_list = list(ngrams(document.split(), n))