Search code examples
pythonmachine-learningsvmdata-analysis

A different number of features in the SVC.coef_ and samples


I downloaded the data.

news = datasets.fetch_20newsgroups(subset='all', categories=['alt.atheism', 'sci.space'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = news.target
print(X.shape)

The shape of X is (1786, 28382)

Next I trained the model and got the coef_ shape

clf = svm.SVC(kernel='linear', random_state=241, C = 1.0000000000000001e-05)
clf.fit(X, y)
data = clf.coef_[0].data
print(data.shape)

The shape is (27189,)

Why the number of features are different?


Solution

  • So in short everything is fine, your weight matrix is in clf.coef_. And it has valid shape, it is a regular numpy array (or scipy sparse array if data is sparse). You can do all needed operations on it, index it etc. What you tried, the .data field is attribute which holds internal storage of the array, which can be of different shape (since it might ignore some redundancies etc.), but the point is you should not use this internal attribute of numpy array for your purpose. It is exposed for low level methods, not for just reading out