python scikit-learn lda topic-modeling countvectorizer

Prepare dataset for the LDA topic models using CountVectorizer

I want to use CountVectorizerfrom Scikitto create a matrix that to be used by LDA model. But my dataset is a sequence of coded terms, for example in the following form:

(1-2252, 5-5588, 10-5478, 2-9632 ....)

How can I tell the CountVectorizer to consider each pair of data i.e. 1-2252 as one word

Solution

Fortunately, I found a helpful blog that gave me the answer.

As I used the following method to tokenize the text:

import re
REGEX = re.compile(r",\s*")
def tokenize(text):
    return [tok.strip().lower() for tok in REGEX.split(text)]

And pass the tokenizer to the CountVectorizer as follows:

tf = CountVectorizer(tokenizer=tokenize)