Search code examples
pythonloopscountvectorizer

How to loop the parameter for ngrams inside countvectorizer?


I want to try 6 different combinations for ngrams which is:

  1. unigram (1,1)
  2. bigram (2,2)
  3. trigram (3,3)
  4. unigram + bigram (1,2)
  5. bigram + trigram (2,3)
  6. unigram + bigram + trigram (1,3)

Is it possible to use for loop or any other way to loop through all the combinations instead of changing the parameter one by one?

pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=no_tokenizer, lowercase=False, binary=True, ngram_range=(1,1))),
('clf', SGDClassifier(loss='log', penalty='l2', max_iter=20, verbose=0))
])
pipeline.fit(train.X, train.y)
preds = pipeline.predict(dev.X)
print(metrics.classification_report(dev.y, preds))

I would like to have all the output from print(metrics.classification_report(dev.y, preds)) for the 6 different combinations as well.


Solution

  • I think the cleanest way would be to use GridSearchCV with a selected "param_grid", but this requires you to choose a specific scoring function. Syntax for accessing specific parameters is described here https://scikit-learn.org/stable/modules/compose.html "5.1.1.1.3. Nested parameters".

    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import SGDClassifier
    from sklearn.model_selection import GridSearchCV
    
    
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=no_tokenizer, lowercase=False, binary=True)),
        ('clf', SGDClassifier(loss='log', penalty='l2', max_iter=20, verbose=0))
    ])
    
    param_grid = {'vect__n_gram_range': [(1, 1), (2, 2), (3, 3), (1, 2), (2, 3), (1, 3)]}
    grid_search = GridSearchCV(pipeline, cv=1, param_grid=param_grid, scoring='f1')
    
    grid_search.fit(train.X, train.y)
    grid_search.score(dev.X, dev.y)
    

    If you really care about getting a full classification report for each possible n_gram_range, you could do the following

    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import SGDClassifier
    
    
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=no_tokenizer, lowercase=False, binary=True)),
        ('clf', SGDClassifier(loss='log', penalty='l2', max_iter=20, verbose=0))
    ])
    
    for n_gram_range in [(1, 1), (2, 2), (3, 3), (1, 2), (2, 3), (1, 3)]:
        pipeline.set_params(vect__n_gram_range=n_gram_range)
        pipeline.fit(train.X, train.y)
        preds = pipeline.predict(dev.X)
        print(metrics.classification_report(dev.y, preds))