python json image-processing scikit-learn knn

KNeighborsClassifier predict throws "Expected 2D array, got 1D array instead"

I am writing an image similarity algorithm. I am using cv2.calcHist to extract image features. After the features are created I save them to a json file as a list of numpy.float64: list(numpy.float64(features)), this is a multidimensional vector embedding.

In a second step I load the data from my json and prepare it for sklearn KNeighborsClassifier.

import numpy as np
import json
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity


with open('data.json') as f:
    jsonData = json.load(f)

X = []
y = []

for image in jsonData['images']:
    embeddingData = image['histogram']
    X.append(embeddingData)
    y.append(image['classification'])

X = np.array(X)
y = np.array(y)

#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

print('Shape of X_train:')
print(X_train.shape)
print('Shape of X_test:')
print(X_test.shape)
print('Shape of y_train:')
print(y_train.shape)

# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors = 1, metric=cosine_similarity)
# Fit the classifier to the data
knn.fit(X_train, y_train)

#show predictions on the test data
y_pred = knn.predict(X_test)

When I run this code, I get the following error on the line

y_pred = knn.predict(X_test)

ValueError: Expected 2D array, got 1D array instead:
array=[1.13707140e-01 9.81128156e-01 2.89475545e-02 ... 0.00000000e+00
 5.02811105e-04 1.15502894e-01].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The output of the shape part is:

Shape of X_train:
(36, 4096)
Shape of X_test:
(9, 4096)
Shape of y_train:
(36,)

I tried to use the reshape suggestion

y_pred = knn.predict(X_test.reshape(-1, 1))

, which helped other people with the same problem like in this post, but which got me

ValueError: X has 1 features, but KNeighborsClassifier is expecting 4096 features as input.

4096 being the dimensions of my histogram features.

I tried reshaping X_train as well for it to match with X_test again:

knn.fit(X_train.reshape(-1, 1), y_train)

, but this leads to

ValueError: Found input variables with inconsistent numbers of samples: [147456, 36]

At first, I tried a slightly different approach based on a knn example where they trained their model on the iris dataset, but there knn.fit would not accept the training data with the same 2D/1D value error. Then I found this example from pyimagesearch which is pretty much what I want to do, except I have the one intermediate step with the json file. The json however is necessary in my case because I want to add other embeddings later and do not want to recalculate everything.

What I do not understand is why knn.fit accepts the data from X_train, but knn.predict does not accept the data from X_test, which were produced in the same way. Why is the error fixed for one case, but not the other?

I already tried the suggested solutions from this, this and this post, but the solution with reshape does not work in my case, as mentioned above. When I try adding extra brackets like this:

y_pred = knn.predict([X_test])

, I get the following error:

ValueError: Found array with dim 3. KNeighborsClassifier expected <= 2.

I also tried to find other examples, but found very few using similar data structures, and the ones I found did not help either.

I also found this question with the same problem, but the accepted answer is not a solution to the problem.

Here's the json file I read from.

Solution

As there is the error message "Expected 2D array, got 1D array instead" on the instruction knn.predict(X_test), it is logical to think that X_test doesn't have the good dimensions but as you said X_test does have the correct dimensions so at first sight it doesn't seem to make sense.

Indeed, he error message is somewhat misleading in this particular case as the problem is hidden in the definition of knn 2 lines above and in particular to its metric:

knn = KNeighborsClassifier(n_neighbors = 1, metric=cosine_similarity)

If you change the metric for 'cosine', it will work.

Not very intuitive but in the doc you will find the strings possible for metric and it also says that you can use a function as you tried to do though the function should take two 1D arrays as inputs and return a scalar:

metric: str or callable, default=’minkowski’ Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2. See the documentation of scipy.spatial.distance and the metrics listed in distance_metrics for valid metric values. [...] If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors [...]

But if you look at the definition of cosine_similarity(), it says that this function takes two 2D arrays and return one 2D array.

That's why you got the error message "expected 2D, got 1D". The error message was not directly linked to what was given to predict() but to what was given to the metric function that was called by predict() !