Search code examples
pythonscikit-learntext-classificationcoremlcoremltools

Input parameter for model as string in Text classification


I am building document classification system using scikit-learn and it works fine. I am converting the model to Core ML model format. But the model format excepts the input parameter as multiArrayType. I want make it to excepts string or array of string so that I can easily predict from IOS application.I have tried following way:

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

#testing a value
docs_new = ['get exclusive prize offer']
docs_pred_class = nb.predict(count_vect.transform(docs_new))

#Exporting to coremodel
import coremltools

coreml_model = coremltools.converters.sklearn.convert(logreg)
#print model
coreml_model

Printing the coreml model gives following output:

 input {
     name: "input"
     type {
     multiArrayType {
      shape: 7505
      dataType: DOUBLE
    }
  }
}
output {
  name: "classLabel"
  type {
    int64Type {
    }
  }
}
output {
  name: "classProbability"
  type {
    dictionaryType {
      int64KeyType {
      }
    }
  }
  }
  predictedFeatureName: "classLabel"
predictedProbabilitiesName: "classProbability" 

I checked the Core ML model in GitHub library, I can see there is different input and output.

How can I achieve this, so that I can pass a simple parameter from IOS app to make prediction.


Solution

  • It sounds like that other mlmodel you found uses a DictVectorizer to turn the strings into indexes (possibly followed by a OneHotEncoder).

    You can do this by making a pipeline in sklearn and converting that pipeline to Core ML.