I want to use CountVectorizer
from Scikit
to create a matrix that to be used by LDA
model. But my dataset is a sequence of coded terms, for example in the following form:
(1-2252, 5-5588, 10-5478, 2-9632 ....)
How can I tell the CountVectorizer
to consider each pair of data i.e. 1-2252
as one word
Fortunately, I found a helpful blog that gave me the answer.
As I used the following method to tokenize the text:
import re
REGEX = re.compile(r",\s*")
def tokenize(text):
return [tok.strip().lower() for tok in REGEX.split(text)]
And pass the tokenizer to the CountVectorizer
as follows:
tf = CountVectorizer(tokenizer=tokenize)