Search code examples
pythonmachine-learningscikit-learntraining-datamultilabel-classification

adding more data to Support Vector Classifier training


I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?

Question:

It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?

My code so far is below:

def preprocess(data, x, y): 
  global Xfeatures 
  global y_train
  global labels
  porter = PorterStemmer()
  multilabel=MultiLabelBinarizer()
  y_train=multilabel.fit_transform(data[y])
  print("\nLabels are now binarized\n")
  data[multilabel.classes_] = y_train
  labels = multilabel.classes_
  print(labels)
  data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
  print("\English stop words were extracted\n")
  data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
  corpus = data[x].apply(nfx.remove_stopwords)
  corpus = data[x].apply(lambda x: porter.stem(x))
  tfidf = TfidfVectorizer()
  Xfeatures = tfidf.fit_transform(corpus).toarray()
  print('\nThe text is now vectorized\n')
  return Xfeatures, y_train


 Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')

Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]

def model(modelo, tipo):
  svc= modelo
  clf = tipo(svc)
  clf.fit(Xfeatures_train,y_train_features)
  clf_predictions = clf.predict(X_test)
  return clf_predictions 

preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)
 

Solution

  • It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.

    However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.