Search code examples
pythonpandasnumpyknnnearest-neighbor

ValueError: setting an array element with a sequence while running NearestNeighbor


I have pyspark dataframe like this:

+------+---------------------------------------------------------------------+
|id    |features                                                             |
+------+---------------------------------------------------------------------+
|2484  |[0.016910851, 0.025989642, 0.0025321299, -0.022232508, -0.00701562]  |
|2504  |[0.015019539, 0.024844216, 0.0029279909, -0.020771071, -0.0061111804]|
|2904  |[0.014104126, 0.02474243, 0.0011707658, -0.021675153, -0.0050868453] |
|3084  |[0.110674664, 0.17139696, 0.059836507, -0.1926481, -0.060425207]     |
|3164  |[0.17688861, 0.2159168, 0.10567094, -0.17365277, -0.016458606]       |
|377784|[0.18425785, 0.34397766, 0.022859085, -0.35151178, -0.07897296]      |
|425114|[0.14556459, 0.25762737, 0.09011796, -0.27128243, 0.011280057]       |
|455074|[0.13579306, 0.3266111, 0.016416805, -0.31139722, -0.054227617]      |
|532624|[0.22281846, 0.1575731, 0.14126688, -0.29887098, -0.09433056]        |
|781654|[0.1381407, 0.14674455, 0.06877328, -0.13415968, -0.06589967]        |
+------+---------------------------------------------------------------------+

Now I have to find nearest neighbor for this features so here are my step:

df_collect = df.toPandas()
#converting list column to array
df_collect['features'] = df_collect['features'].apply(lambda x: np.array(x))
features = df_collect['features'].to_numpy()

knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)

Now here I'm getting error:

TypeError                                 Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_6511/1498389666.py in <module>
----> 1 knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/neighbors/_unsupervised.py in fit(self, X, y)
    164             The fitted nearest neighbors estimator.
    165         """
--> 166         return self._fit(X)

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y)
    433         else:
    434             if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
--> 435                 X = self._validate_data(X, accept_sparse="csr")
    436 
    437         self._check_algorithm_metric()

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744                     array = array.astype(dtype, casting="unsafe", copy=False)
    745                 else:
--> 746                     array = np.asarray(array, order=order, dtype=dtype)
    747             except ComplexWarning as complex_warning:
    748                 raise ValueError(

ValueError: setting an array element with a sequence.

I have checked all the size of subarray and everything's same and also data type. Can someone please point out what can be wrong here.

Output of features:

array([array([ 0.01691085,  0.02598964,  0.00253213, -0.02223251, -0.00701562]),
       array([ 0.01501954,  0.02484422,  0.00292799, -0.02077107, -0.00611118]),
       array([ 0.01410413,  0.02474243,  0.00117077, -0.02167515, -0.00508685]),
       ...,
       array([ 0.01896316,  0.03188267,  0.00258667, -0.02800867, -0.00646481]),
       array([ 0.03538242,  0.07453772,  0.00816828, -0.02914227, -0.0942148 ]),
       array([ 0.02470775,  0.02561068,  0.00401011, -0.02863882, -0.00419102])],
      dtype=object)

Solution

  • df.toPandas() returns a column of lists. You need to convert this column of lists to a 2D array. When you do df_collect['features'].apply(lambda x: np.array(x)).to_numpy() you get an array of arrays which is not the same as a 2D array. So you need

    df_collect = df.toPandas()
    features = np.array(df_collect.features.to_list())
    knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)
    

    As an alternative, you can directly pass the nested list to NearestNeighbors:

    knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(df_collect.features.to_list())