Search code examples
pythonscikit-learnnlpcountvectorizer

How are "word boundaries" identified in Python sklearn CountVectorizer's analyzer parameter?


Python sklearn CountVectorizer has an "analyzer" parameter which has a "char_wb" option. According to the definition,

"Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.". 

My question here is, how does CountVectorizer identify a "word" from a string? More specifically, are "words" simply space-separated strings from a sentence, or are they identified by more complex techniques like word_tokenize from nltk?

The reason I ask this is that I am analyzing social media data which has a whole lot of @mentions and #hashtags. Now, nltk's word_tokenize breaks up a "@mention" into ["@", "mention], and a "#hashtag" into ["#", "hashtag"]. If I feed these into CountVectorizer with ngram_range > 1, the "#" and "@" will never be captured as features. Moreover, I want character n-grams (with char_wb) to capture "@m" and "#h" as features, which won't ever happen if CountVectorizer breaks up @mentions and #hashtags into ["@","mentions"] and ["#","hashtags"].

What do I do?


Solution

  • It seperates words by whitespace as you can see in the source code.

    def _char_wb_ngrams(self, text_document):
        """Whitespace sensitive char-n-gram tokenization.
        Tokenize text_document into a sequence of character n-grams
        operating only inside word boundaries. n-grams at the edges
        of words are padded with space."""
        # normalize white spaces
        text_document = self._white_spaces.sub(" ", text_document)
    
        min_n, max_n = self.ngram_range
        ngrams = []
    
        # bind method outside of loop to reduce overhead
        ngrams_append = ngrams.append
    
    
        for w in text_document.split():
            w = ' ' + w + ' '
            w_len = len(w)
            for n in range(min_n, max_n + 1):
                offset = 0
                ngrams_append(w[offset:offset + n])
                while offset + n < w_len:
                    offset += 1
                    ngrams_append(w[offset:offset + n])
                if offset == 0:   # count a short word (w_len < n) only once
                    break
        return ngrams
    

    text_document.split() splits by whitespace.