I am trying to create a tf-idf table using TfidfVectorizer
from sklearn
package in python. For example I have a corpus of one string
"PD-L1 expression positive (≥1%–49%) and negative for actionable molecular markers"
TfidfVectorizer
has an token_pattern
argument that indicates how the token should be like.
The default is token_pattern = token_pattern='(?u)\b\w\w+\b'
, it will split all the words by space and remove the numbers and special characters to create the tokens, and generates some tokens like below
["pd", "expression", "positive","and" ,"negative" ,"for" ,"actionable" ,"molecular" ",markers"]
But something I would like to have is:
["pd-l1", "expression", "positive", "≥1%–49%","and" ,"negative" ,"for" ,"actionable" "molecular" ,"markers"]
I was tweaking token_pattern
argument for hours but cannot get it right. Alternatively, Is there here a way to tell explicitly to the vectorizer that I want to havepd-l1
and >1%-49%
as token without going too wild on regrex? Any help is
very appreciated!
I get it using pattern '[^ ()]+'
- all chars except space
, (
, )
It may need to add punctuations
to this list.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"PD-L1 expression positive (≥1%–49%) and negative for actionable molecular markers"
]
vectorizer = TfidfVectorizer()
print('token_pattern:', vectorizer.token_pattern)
vectorizer.token_pattern = '[^ ()]+'
print('token_pattern:', vectorizer.token_pattern)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
Result
['actionable', 'and', 'expression', 'for', 'markers', 'molecular', 'negative', 'pd-l1', 'positive', '≥1%–49%']
I used example code from documetation TfidfVectorizer
EDIT:
I checked documentation and I could set it directly
vectorizer = TfidfVectorizer(token_pattern='[^ ()]+')