Search code examples
pythonnltktokenize

python tokenizer 2 words phrases to word2vec model


I'm using python gensim package for word2vec.

I want to run the model on tokenize words and 2-words phrase. I have 10,000~ documents and I used the nltk Regextoknizer to get the single word tokens from all the documents. How can I tokenizer the document to get also the 2-words phrase.

For example:

document: "I have a green apple"

and the 2 word phrase: {I_have}, {green_apple}, ... etc.


Solution

  • one option is to use ngrams from nltk and set n=2 like this to get a list of tuples:

    from nltk import ngrams
    n = 2
    bigram_list = list(ngrams(document.split(), n))