Search code examples
pythonmachine-learninglogistic-regression

Logistic Regression => ValueError: setting an array element with a sequence


I am new to machine learning, I am trying to apply logistic regression on my sample data set I have a single feature that contains a list of numbers and want to predict class.

the following is my code

from sklearn.linear_model import LogisticRegression
a = [[1,2,3], [1,2,3,4,5,6], [4,5,6,7], [0,0,0,7,1,2,3]]
b = [0,1,0, 0]
p = [[9,0,2,4]]

clfModel1 = LogisticRegression(class_weight='balanced')
clfModel1.fit(a,b)
clfModel1.predict(p)

I am getting the following error

Traceback (most recent call last):
  File "F:\python_3.4\NLP\t.py", line 7, in <module>
    clfModel1.fit(a,b)
  File "C:\Python34\lib\site-packages\sklearn\linear_model\logistic.py", line 1173, in fit
    order="C")
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 521, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
>>>

Is there some way to change the data such that I can the apply the classifier and predict the results


Solution

  • Logistic regression is an estimator for functions of form:

    R^d -> [0,1]
    

    But your data clearly is not a subset of R^d, as each sample in a has different length (number of dimensions), thus it cannot be applied.

    Another problem is that p should be a list of samples too, not a single sample (and it has to have d dimensions too, of course).

    There is no "way around this" it is simply a wrong idea. What is a typical solution to working with "odd" data:

    • you predefine your own, custom mapping (feature extraction step) that given your varied-length point outputs a fixed length representation (so outputs d numbers). There is no general way of doing that - everything depends on data.
    • there are models that can deal with varied length inputs, such as LSTMs, but it is a huge jump from logistic regression to recurrent neural nets.
    • use methods which are similarity-based (like kNN) and simply define your own measure of what it means that two "lists of numbers" are similar.

    There is no other way - either rethink representation of your data, or change approach.