python regex scikit-learn tokenize tfidfvectorizer

How to make sklearn.TfidfVectorizer tokenize special phrases?

I am trying to create a tf-idf table using TfidfVectorizer from sklearn package in python. For example I have a corpus of one string "PD-L1 expression positive (≥1%–49%) and negative for actionable molecular markers"

TfidfVectorizer has an token_pattern argument that indicates how the token should be like. The default is token_pattern = token_pattern='(?u)\b\w\w+\b', it will split all the words by space and remove the numbers and special characters to create the tokens, and generates some tokens like below

["pd", "expression", "positive","and" ,"negative" ,"for" ,"actionable" ,"molecular" ",markers"]

But something I would like to have is:

["pd-l1", "expression", "positive", "≥1%–49%","and" ,"negative" ,"for" ,"actionable" "molecular" ,"markers"]

I was tweaking token_pattern argument for hours but cannot get it right. Alternatively, Is there here a way to tell explicitly to the vectorizer that I want to havepd-l1 and >1%-49% as token without going too wild on regrex? Any help is very appreciated!

Solution

I get it using pattern '[^ ()]+' - all chars except space, (, )

It may need to add punctuations to this list.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
 "PD-L1 expression positive (≥1%–49%) and negative for actionable molecular markers"
]

vectorizer = TfidfVectorizer()
print('token_pattern:', vectorizer.token_pattern)

vectorizer.token_pattern = '[^ ()]+'
print('token_pattern:', vectorizer.token_pattern)

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())

Result

['actionable', 'and', 'expression', 'for', 'markers', 'molecular', 'negative', 'pd-l1', 'positive', '≥1%–49%']

I used example code from documetation TfidfVectorizer

EDIT:

I checked documentation and I could set it directly

vectorizer = TfidfVectorizer(token_pattern='[^ ()]+')