Search code examples
pythonregexscikit-learntokenizetfidfvectorizer

How to make sklearn.TfidfVectorizer tokenize special phrases?


I am trying to create a tf-idf table using TfidfVectorizer from sklearn package in python. For example I have a corpus of one string "PD-L1 expression positive (≥1%–49%) and negative for actionable molecular markers"

TfidfVectorizer has an token_pattern argument that indicates how the token should be like. The default is token_pattern = token_pattern='(?u)\b\w\w+\b', it will split all the words by space and remove the numbers and special characters to create the tokens, and generates some tokens like below

["pd", "expression", "positive","and" ,"negative" ,"for" ,"actionable" ,"molecular" ",markers"]

But something I would like to have is:

["pd-l1", "expression", "positive", "≥1%–49%","and" ,"negative" ,"for" ,"actionable" "molecular" ,"markers"]

I was tweaking token_pattern argument for hours but cannot get it right. Alternatively, Is there here a way to tell explicitly to the vectorizer that I want to havepd-l1 and >1%-49% as token without going too wild on regrex? Any help is very appreciated!


Solution

  • I get it using pattern '[^ ()]+' - all chars except space, (, )

    It may need to add punctuations to this list.

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    corpus = [
     "PD-L1 expression positive (≥1%–49%) and negative for actionable molecular markers"
    ]
    
    vectorizer = TfidfVectorizer()
    print('token_pattern:', vectorizer.token_pattern)
    
    vectorizer.token_pattern = '[^ ()]+'
    print('token_pattern:', vectorizer.token_pattern)
    
    X = vectorizer.fit_transform(corpus)
    
    print(vectorizer.get_feature_names())
    

    Result

    ['actionable', 'and', 'expression', 'for', 'markers', 'molecular', 'negative', 'pd-l1', 'positive', '≥1%–49%']
    

    I used example code from documetation TfidfVectorizer


    EDIT:

    I checked documentation and I could set it directly

    vectorizer = TfidfVectorizer(token_pattern='[^ ()]+')