Error while creating KNearestNeighborsClassifier with CoreMLTools 3 beta and question how to set dimensions correctly

for a project I want to create a Core ML 3 model, which is receiving some text (i.e from mails) and classify it. In addition, the model should be updatable and trained on the devices. Therefore, I found that KNearestNeighborsClassifier can be updatable and wanted to used them for my approach. However, first of all I got an error

" RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: Error compiling model: "Error reading protobuf spec. validator error: KNearestNeighborsClassifier requires k to be a positive integer."

while creating such a model with a script (see below). In addition, I am not sure how to use the KNearestNeighborsClassifier for my problem correctly. Especially, which number of dimension is correct one if I want classify some texts? And how will I have to use the model correctly in the app? Maybe you know some useful guide, which I have not found yet=

My script for creating the KNearestNeighborsClassifier is based on this guide: https://github.com/apple/coremltools/blob/master/examples/updatable_models/updatable_nearest_neighbor_classifier.ipynb I have installed and I am using coremltools==3.0b6.

Here my actual script for creating the model:

number_of_dimensions = 128

from coremltools.models.nearest_neighbors import KNearestNeighborsClassifierBuilder

builder = KNearestNeighborsClassifierBuilder(input_name='input',
                                             output_name='output',
                                             number_of_dimensions=number_of_dimensions,
                                             default_class_label='defaultLabel',
                                             number_of_neighbors=3,
                                             weighting_scheme='inverse_distance',
                                             index_type='linear')

builder.author = 'Christian'
builder.license = 'MIT'
builder.description = 'Classifies {} dimension vector based on 3 nearest neighbors'.format(number_of_dimensions)

builder.spec.description.input[0].shortDescription = 'Input vector to classify'
builder.spec.description.output[0].shortDescription = 'Predicted label. Defaults to \'defaultLabel\''
builder.spec.description.output[1].shortDescription = 'Probabilities / score for each possible label.'

builder.spec.description.trainingInput[0].shortDescription = 'Example input vector'
builder.spec.description.trainingInput[1].shortDescription = 'Associated true label of each example vector'


#This lets the developer of the app change the number of neighbors at runtime from anywhere between 1 and 10, with a default of 3.
builder.set_number_of_neighbors_with_bounds(3, allowed_range=(1, 10))


# Let's set the index to kd_tree with leaf size of 30
builder.set_index_type('kd_tree', 30)


# By default an empty knn model is updatable
print(builder.is_updatable)

print(builder.number_of_dimensions)
print(builder.number_of_neighbors)
print(builder.number_of_neighbors_allowed_range())
print(builder.index_type)


mlmodel_updatable_path = './UpdatableKNN.mlmodel'

# Save the updated spec
from coremltools.models import MLModel
mlmodel_updatable = MLModel(builder.spec)
mlmodel_updatable.save(mlmodel_updatable_path)

I hope that you can help me by telling me if overall my approach using the KNearestNeighborsClassifier for text classification is senseful and hopefully you can help me to create successfully the CoreML model.

Many thanks in advance.

Solution

Not sure why you're getting that error, although make sure you're using the latest (beta) version of coremltools (3.0b6 currently).

As for the number of dimensions, you'll need to convert your text into a vector of a fixed length somehow. Exactly how you do that is totally up to the problem you're trying to solve.

For example, you could use the bag-of-words technique to turn a phrase into such a vector. You can use word embeddings, or a neural network, or any of the other common techniques for this.

But you need some way to turn the text into feature vectors.