I'm trying to use class weights in a Scikit learn SVM classifier using RandomizedSearchCV
.
clf= svm.SVC(probability=True, random_state=0)
parameters = {'clf__C': scipy.stats.expon(scale=100), 'clf__gamma': scipy.stats.expon(scale=.1),
'clf__kernel': ['rbf'], 'clf__class_weight':['balanced', None]}
search=RandomizedSearchCV(estimator=clf, param_distributions=parameters, scoring='f1_micro',
cv=5, n_iter=100, random_state=0)
search.fit(features,labels)
I have 4 classes. Now for the class_weight I would like to have random values between 0 and 1 for each of the four classes. It could be done with
'class_weight':[{0: w} for w in [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]]
But this is only for one class and the values are discrete and not just sampled between 0 and 1.
How can I solve this?
Last but not least, does it matter if I'm using values between 0 and 1 or between 1 and 10 (i.e. are the weights rescaled)?
And should the weights of all 4 classes sum up always to the same value (e.g. 1)?
I am not aware of possibility of passing the distributions as keys of dictionary. As an improvement on the workaround you have came up with, you could use:
from sklearn.utils.class_weight import compute_class_weight
from scipy.stats import lognorm
class_weight = compute_class_weight("balanced", np.unique(y), y)
class_weights = []
for mltp in lognorm(s = 1, loc = 1, scale = class_weight[0]).rvs(50):
class_weights.append(dict(zip([0, 1], class_weight * [mltp, 1/mltp])))
Then you can pass class_weights
to the clf__class_weight
entry in parameters
for RandomizedSearchCV
. Extending this to multi class scenario or using different distributions is straightforward.
Note that you in fact sample twice. Once from the true distribution and then via RandomizedSearchCV
from this sample. If you make sure that you regenerate class_weights
before each call to fit or that you make the initial sample big enough, this workaround should work well in your case.
EDIT:
Bettter solution would be to define your own class implementing rvs
method. You can do this even without having to subclass exisiting scipy.stats
distribution as:
class ClassWeights(object):
"""
Draw random variates for cases when parameter is a dict.
Should be personalized as needed.
"""
def __init__(self,y, *args, **kwargs):
self.class_weights = compute_class_weight("balanced", np.unique(y), y)
self._make_dists()
def _make_dists(self):
self.dist0 = gamma(self.class_weights[0])
self.dist1 = gamma(self.class_weights[1])
def rvs(self, *args, **kwargs):
"""override method for drawing random variates"""
ret_val = { 0: self.dist0.rvs(*args, **kwargs),
1: self.dist1.rvs(*args, **kwargs)}
return ret_val
In answer to your other two questions:
The weights can take any positive value (including 0) and they do not have to sum up to 1. What is important is their relative, not absolute, magnitude.