Search code examples
pythonmachine-learningscikit-learnnlpkaggle

TfIdfVectorizer not tokenizing properly


As far as I'm concerned, there is no question like this. I'm working on a NLP and sentiment analysis project in Kaggle and first of all I'm preparing my data. The dataframe is a text column followed by a number from 0 to 9 which categorizes which cluster does the row (the document) belongs. I'm using TF-IDF Vectorizer in sklearn. I want to get rid of anything that's not an english language word, so I'm using the following:

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

s_words = list(nltk.corpus.stopwords.words("english"))

c = TfidfVectorizer(sublinear_tf=False,
                    stop_words=s_words,
                    token_pattern =r"(?ui)\\b\\w*[a-z]+\\w*\\b",
                    tokenizer = LemmaTokenizer(),
                    analyzer = "word",
                    strip_accents = "unicode")

#a_df is the original dataframe
X = a_df['Text']
X_text = c.fit_transform(X)

which as far as I know, when calling c.get_feature_names() should return only the tokens which are proper words, without numbers or punctuation symbols. I found the regex in a post in StackOverflow, but using a simpler one like [a-zA-Z]+ will do exactly the same (this is, nothing). When I call the feature names, I get stuff like

["''abalone",
"#",
"?",
"$",
"'",
"'0",
"'01",
"'accidentally",
...]

Those are just examples, but it's representative of the output I get, instead of just the words. I've been stuck with this for days trying different regular expressions or methods to call. Even hardcoded some of the outputs for the features on the stop words. I'm asking this because later I'm using LDA to get the topics of each cluster and get punctuation symbols as the "topics". I hope I'm not duplicating another post. Anymore information I need to provide will do gladly. Thank you in advance!


Solution

  • The regex pattern gets ignored if you pass a custom tokenizer. This is not mentioned in the documentation, but you can see it clearly in the source code here:

    https://github.com/scikit-learn/scikit-learn/blob/9e5819aa413ce907134ee5704abba43ad8a61827/sklearn/feature_extraction/text.py#L333

    def build_tokenizer(self):
        """Return a function that splits a string into a sequence of tokens.
        Returns
        -------
        tokenizer: callable
              A function to split a string into a sequence of tokens.
        """
        if self.tokenizer is not None:
            return self.tokenizer
        token_pattern = re.compile(self.token_pattern)
        return token_pattern.findall
    

    If self.tokenizer is not None, you will not do anything with the token pattern.

    Solving this is straightforward, just put the regex token pattern in your custom tokenizer, and use this to select tokens.