Search code examples
pythonscikit-learnldatopic-modelingcountvectorizer

Prepare dataset for the LDA topic models using CountVectorizer


I want to use CountVectorizerfrom Scikitto create a matrix that to be used by LDA model. But my dataset is a sequence of coded terms, for example in the following form:

(1-2252, 5-5588, 10-5478, 2-9632 ....)

How can I tell the CountVectorizer to consider each pair of data i.e. 1-2252 as one word


Solution

  • Fortunately, I found a helpful blog that gave me the answer.

    As I used the following method to tokenize the text:

    import re
    REGEX = re.compile(r",\s*")
    def tokenize(text):
        return [tok.strip().lower() for tok in REGEX.split(text)]
    

    And pass the tokenizer to the CountVectorizer as follows:

    tf = CountVectorizer(tokenizer=tokenize)