Search code examples
python-3.xscikit-learnjupyter-notebooksvmtfidfvectorizer

Extract feature importance of ngrams in tfidfvectorizer in SVC(kernel='linear') model


I'm wondering what causes difference in output when it should be the same. It's like my program is ignoring the sorted function and feature_names. The sorting of the coef_ is quite crucial for me to find out which features are actually helping the predictions the most. I get the individual words from vectorizer.get_feature_names but not when it is in a loop or function definition. Does anybody have any idea what could be happening, or if anybody has other methods of extracting ngram feature weights and their names for an SVC with kernel='linear'.

My code:

## load data features with removed columns based on numeric feature selection
df = pd.read_csv('preprocessed_data_all.csv', usecols=['normalized_fixed', 'TAG', 'DEP', 'level1', 'avg_wordlength',
 'lexical_variety',
 'avg_sentlength',
 'VBD_rel_cnt',
 'VBN_rel_cnt',
 'VBG_rel_cnt',
 'MD_rel_cnt',
 'np_rel_cnt',
 'clause_rel_cnt',
 'clause_rel_word_cnt'])

df = df.sample(n=2000)

# define X and y 
X = df.drop('level1', axis=1)
y = df.level1.values

## create pipeline for word unigrams
bow_pipe = Pipeline([
    ("text", ItemSelector(key="normalized_fixed")),
    ("bow_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

## create pipeline for pos tags
pos_pipe = Pipeline([
    ("pos", ItemSelector(key="TAG")),
    ("pos_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

## create pipeline for dependency tags
dep_pipe = Pipeline([
    ("dep", ItemSelector(key="DEP")),
    ("dep_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

# define classifier
svm = SVC(kernel='linear', class_weight='balanced')

# define pipeline for most important unigram extraction
pipe = Pipeline([
    ("feats", FeatureUnion([
        ("bow", bow_pipe),
        ("tag", pos_pipe),
        ("dep", dep_pipe)
    ])),
    ("clf", svm)
])

pipe.fit(X, y)

# display feature importance of BOW model
levels = df['level1'].unique()

for level in levels: 
    
    featuredf = pd.DataFrame()

    labelid = list(pipe.named_steps['clf'].classes_).index(level)
    feature_names = pipe.named_steps['feats'].transformer_list[0][1].named_steps['bow_vec'].get_feature_names()
    topn = sorted(zip(pipe.named_steps['clf'].coef_[labelid], feature_names))[-10:]

    for coef, feat in topn:
        featuredf = featuredf.append(pd.Series([level, feat, coef]), ignore_index = True)

    display(featuredf)

My output:

    0   1   2
0   A1  !   (0, 1834)\t-0.07826243560812945\n (0, 4347)\t-0.07826243560812945\n (0, 4760)\t-0.223132736239871\n (0, 5498)\t-0.07140284578344763\n (0, 6756)\t-0.16195282546411804\n (0, 8637)\t-0.06337764014791308\n (0, 8763)\t-0.07826243560812945\n (0, 9044)\t-0.08060172162144445\n (0, 901)\t-0.0026223432774063423\n (0, 5906)\t-0.16675468967573015\n (0, 6796)\t-0.04403627031278603\n (0, 8495)\t-0.2603055807883978\n (0, 8498)\t-0.17305812627971506\n (0, 8735)\t-0.34489400420874144\n (0, 9484)\t-0.11083343873432677\n (0, 2637)\t-0.18040783909656172\n (0, 2737)\t-0.5380874813828527\n (0, 3129)\t-0.013035612996414479\n (0, 3773)\t-0.08449907288128825\n (0, 4437)\t-0.013035612996414479\n (0, 4438)\t-0.026071225992828958\n (0, 5924)\t-0.013035612996414479\n (0, 7269)\t-0.3730438143689519\n (0, 7737)\t-0.705047869548585\n (0, 8722)\t-0.024098248030544236\n :\t:\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.08649110380251428\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t0.022121814035341927\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.3512603080433229\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t-0.005938550730073638\n (0, 8445)\t-0.06546829964399035\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t0.022781224878066095\n (0, 3739)\t0.03267210725715032
0   1   2
0   B1  !   (0, 353)\t-0.00449726057217602\n (0, 802)\t-0.05617787611978642\n (0, 973)\t-0.10543173735834135\n (0, 1847)\t-0.007780155148241354\n (0, 1989)\t-0.003934155442206846\n (0, 2017)\t-0.005086622578660749\n (0, 2204)\t-0.031113872051853505\n (0, 2405)\t-0.09318613349857544\n (0, 3024)\t-0.005086622578660749\n (0, 3283)\t-0.10089509076272042\n (0, 4556)\t-0.00449726057217602\n (0, 5175)\t-0.005086622578660749\n (0, 5454)\t-0.32011264216698354\n (0, 5724)\t-0.003934155442206846\n (0, 6015)\t-0.005086622578660749\n (0, 6330)\t-0.005086622578660749\n (0, 6473)\t-0.004194952284695256\n (0, 6534)\t-0.19221655261459114\n (0, 6582)\t-0.031591903060786936\n (0, 7980)\t-0.32174386411546047\n (0, 7992)\t-0.004825825736172337\n (0, 9514)\t-0.17326784128005032\n (0, 9556)\t-0.08135115057424913\n (0, 9654)\t-0.004194952284695256\n (0, 9746)\t-0.24722791235969363\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.018006244391458114\n (0, 2945)\t-0.025529859622348973\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.018006244391458114\n (0, 4711)\t-0.012262917681600417\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t-0.05074854733491452\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.003648836702785919\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.20173441895833755\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t-0.049447249347229494\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t-0.032859710207201465\n (0, 9630)\t0.022781224878066095
0   1   2
0   A2  !   (0, 1510)\t-0.047241319436499236\n (0, 4554)\t-0.09138323899895806\n (0, 5454)\t-0.0062230357565567634\n (0, 7756)\t-0.061785302573242856\n (0, 281)\t-0.01653184338009155\n (0, 351)\t-0.01653184338009155\n (0, 450)\t-0.3274464832370879\n (0, 809)\t-0.013387638815271769\n (0, 2051)\t-0.014616379782250303\n (0, 2586)\t-0.01653184338009155\n (0, 2741)\t-0.15810190062867993\n (0, 3224)\t-0.06225932224260644\n (0, 3373)\t-0.12280247879038902\n (0, 3421)\t-0.015684237235273946\n (0, 3819)\t-0.01653184338009155\n (0, 3833)\t-0.3359646619748352\n (0, 4068)\t-0.015684237235273946\n (0, 4402)\t-0.07152757844346042\n (0, 4649)\t-0.3279430542171356\n (0, 5524)\t-0.0899771265578215\n (0, 5790)\t-0.3885263430136202\n (0, 7822)\t-0.059872091526754725\n (0, 505)\t-0.0711692477199759\n (0, 5724)\t-0.16023961429736408\n (0, 6286)\t-0.049366239531379814\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.08531697506522545\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.007931536811222977\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t-0.10456248357681736\n (0, 6865)\t-0.03968381644376268\n (0, 6980)\t-0.11581955114710678\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t-0.28868175238220484\n (0, 7606)\t-0.03988088938576907\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t0.008341811597571965\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t-0.0659929953660953\n (0, 9630)\t-0.05573508827608763
0   1   2
0   B2  !   (0, 1604)\t-0.5452053299558446\n (0, 1611)\t-0.14349203210584277\n (0, 1786)\t-0.07926751381540288\n (0, 4402)\t-0.061638227000430465\n (0, 4469)\t-0.18047516283733558\n (0, 4483)\t-0.12632546444958545\n (0, 7467)\t-0.1657501793150448\n (0, 7953)\t-0.2027592110690899\n (0, 7991)\t-0.0705445978132748\n (0, 9157)\t-0.1576966747397613\n (0, 9746)\t-0.13158162095004766\n (0, 776)\t-0.07804759361864515\n (0, 1432)\t-0.04319046246215665\n (0, 1630)\t-0.06742619934269474\n (0, 1903)\t-0.03634857244837165\n (0, 2742)\t-0.04319046246215665\n (0, 2816)\t-0.15050859335152222\n (0, 3562)\t-0.03940488059869191\n (0, 4318)\t-0.04097603902861966\n (0, 4490)\t-0.04319046246215665\n (0, 5187)\t-0.27333877764907855\n (0, 5252)\t-0.04319046246215665\n (0, 5551)\t-0.22302927657831634\n (0, 5790)\t-0.18300512305356684\n (0, 5852)\t-0.029396346557071712\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.05047091757718261\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.09066783596741043\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.08208799126770042\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t-0.011737613363530505\n (0, 8634)\t-0.014588840340588008\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t-0.05074822281767788
0   1   2
0   C1  !   (0, 244)\t-0.09884319674162795\n (0, 690)\t-0.1388650034822139\n (0, 960)\t-0.10605470450461775\n (0, 1373)\t-0.29793485494660743\n (0, 1584)\t-0.15220560572907585\n (0, 1603)\t-0.15220560572907585\n (0, 1604)\t-0.2943386139167361\n (0, 1638)\t-0.15220560572907585\n (0, 2080)\t-0.1444018536776252\n (0, 2680)\t-0.22203398402397742\n (0, 2722)\t-0.1388650034822139\n (0, 2774)\t-0.15220560572907585\n (0, 2822)\t-0.13106125143076322\n (0, 3071)\t-0.1444018536776252\n (0, 3265)\t-0.2691405776324631\n (0, 3627)\t-0.1444018536776252\n (0, 4014)\t-0.15220560572907585\n (0, 4073)\t-0.15220560572907585\n (0, 4247)\t-0.15220560572907585\n (0, 4659)\t-0.3056381346962476\n (0, 4726)\t-0.3044112114581517\n (0, 4868)\t-0.15220560572907585\n (0, 5014)\t-0.1388650034822139\n (0, 5074)\t-0.1444018536776252\n (0, 5450)\t-0.1865505674300888\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t0.020274287275611008\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.09559883514855932\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.13840215323517818\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t0.02183231985676501\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t-0.09844846169130346
0   1   2
0   C2  !   (0, 1510)\t-0.05959414411701482\n (0, 1925)\t-0.07930619936349027\n (0, 4554)\t-0.12239420823695288\n (0, 6337)\t-0.10751794751817559\n (0, 6919)\t-0.11905736382524099\n (0, 7940)\t-0.4509514511406135\n (0, 8674)\t-0.06634760477509609\n (0, 8876)\t-0.09955165597499022\n (0, 281)\t-0.11414308129642091\n (0, 351)\t-0.11414308129642091\n (0, 450)\t-0.6510470682457199\n (0, 2051)\t-0.2941575097638065\n (0, 2586)\t-0.11414308129642091\n (0, 2741)\t-0.2826990863413035\n (0, 3224)\t-0.19484491340690077\n (0, 3421)\t-0.15593785912375202\n (0, 3819)\t-0.11414308129642091\n (0, 3833)\t-0.5959237864651308\n (0, 4068)\t-0.17821843998174858\n (0, 7822)\t-0.18737390753407893\n (0, 505)\t-0.17144684026449575\n (0, 1605)\t-0.1761432003933918\n (0, 2087)\t-0.30641937334729397\n (0, 2435)\t-0.01959392434397809\n (0, 2544)\t-0.04205087201608662\n :\t:\n (0, 1543)\t-0.10270397939145476\n (0, 4091)\t0.03510805887202453\n (0, 4621)\t0.03774871894954156\n (0, 5548)\t0.11591078340050759\n (0, 7216)\t0.10996790758187949\n (0, 8462)\t0.11591078340050759\n (0, 6902)\t0.0953038712770817\n (0, 275)\t0.05201608398317632\n (0, 2309)\t-0.11518506286788033\n (0, 5602)\t0.130229016246211\n (0, 6856)\t0.028341275217697887\n (0, 9697)\t0.130229016246211\n (0, 5898)\t0.11137290054484401\n (0, 5921)\t0.09019079917603834\n (0, 6930)\t-0.41244426698962444\n (0, 7468)\t0.11137290054484401\n (0, 8776)\t0.08870699446777193\n (0, 981)\t0.1483908905002444\n (0, 2084)\t0.1483908905002444\n (0, 3159)\t0.02642985841317767\n (0, 3306)\t-0.02982547695328551\n (0, 3508)\t0.055716356911763666\n (0, 8305)\t0.025416449215215475\n (0, 8431)\t0.02642985841317767\n (0, 8682)\t0.027858178455881833

What the output format is supposed to be:

bs obećao -4.50534985071
bs pošto -4.50534985071
bs prava -4.50534985071
bs predstavlja -4.50534985071
bs prošlosedmičnom -4.50534985071
bs sjeveru -4.50534985071
bs taj -4.50534985071
bs vladavine -4.50534985071
bs će -4.50534985071
bs da -4.0998847426

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

Linked to second reply on How to get most informative features for scikit-learn classifier for different class? Also linked to this is the exact question posted as the last reply on the second reply to this post's question:

Amazing @alvas I tried the above function but the output looks like this:POS aaeguno móvil (0, 60) -0.0375375709849 (0, 300) -0.0375375709849 (0, 3279) -0.0375375709849 instead of returning the class, followed by the word and the float. Any idea of why this is happening?. Thanks! – newWithPython Mar 15 '15 at 0:45

But no one has replied to this, and since I have a very low reputation, I cannot request more information there either.

It's taken a week out of my schedule and I really cannot afford to spend much longer on this. It's the last piece of the puzzle that is my thesis, which is not going to be perfect but I just need to get it done and graduate. So any help would be greatly appreciated! Also let me know what I could add to make this question clearer, it's maybe my third or second one on this platform.


Solution

  • As it turns out, using sklearn's LinearSVC()produces the right output, so SVC(kernel='linear') requires other means of ngram importance extraction. I just switched to LinearSVC as it improved my model in general anyway.