Search code examples
numpyscikit-learnlogistic-regression

ValueError: setting an array element with a sequence (LogisticRegression with Array based feature)


Thanks in advance for any guidance. I'm attempting to do classification via Logistic Regression using scikit-learn where the X is Intercept and one field that is an array of heartrate data called heartrate. Based on researching others who've also faced this error I've made sure the heartrate arrays are all the same shape/size.

It's getting the value error in sklearn/utils/validation.py line 382, in check_array on the line where a copy of the dataframe is done via array = np.array(array, dtype=dtype, order=order, copy=copy). I suspect that my arrays aren't contiguous in memory and that's what's posing the problem but not sure...

Here are some code snip-its to help sleuth out the problem:

    def get_training_set(self):
        training_set = []
        after_date = datetime.utcnow() - timedelta(weeks=8)
        before_date = datetime.utcnow() - timedelta(hours=12)
        activities = self.strava_client.get_activities(after=after_date, before=before_date)
        for act in activities:
            if act.has_heartrate:
                streams = self.strava_client.get_activity_streams(activity_id=act.id, types=['heartrate'])
                heartrate = np.array(list(filter(lambda x: x is not None, streams['heartrate'].data)))
                fixed_heartrate = np.pad(heartrate, (0, 15000 - len(heartrate)), 'constant')
                item = {'activity_type': self.classes.index(act.type),'heartrate': fixed_heartrate}
                training_set.append(item)
        return pd.DataFrame(training_set)

    def train(self):
        df = self.get_training_set()
        df['Intercept'] = np.ones((len(df),))
        y = df[['activity_type']]
        X = df[['Intercept', 'heartrate']]
        y = np.ravel(y)
        #
        model = LogisticRegression()
        self.debug('y={}'.format(y))
        model = model.fit(X,y)

The exception occurs in fit...

Thanks in advance for any guidance.

Respect,

Mike

copied from comment for improved formatting:

/python3.5/site-packages/sklearn/linear_model/logistic.py", line 1173, in 
    fit order="C") 
File "/python3.5/site-packages/sklearn/utils/validation.py", line 521, in 
    check_X_y ensure_min_features, warn_on_dtype, estimator) 
File "/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in 
    check_array array = np.array(array, dtype=dtype, order=order, copy=copy) 
ValueError: setting an array element with a sequence

and the other comment:

X and y look like this:

X.shape=(29, 2) 
y.shape=(29,) 
X=[[1 array([74, 74, 77, ..., 0, 0, 0])] 
   [1 array([66, 67, 69, ..., 0, 0, 0])] 
   ...          
   [1 array([92, 92, 91, ..., 0, 0, 0])] 
   [1 array([79, 79, 79, ..., 0, 0, 0])]] 
y=[ 0 11 11 0 1 0 11 0 11 1 0 11 0 0 11 0 0 0 0 0 11 0 11 0 0 0 11 0 0]

Solution

  • Do things work better if you change train() so look like this?

    def train(self):
        df = self.get_training_set()
        df['Intercept'] = 1                       # (a)
        y = df['activity_type'].values            # (b)
        X = [np.concatenate(( np.array([col1]), col2 )) for col1, col2 in df[['Intercept', 'heartrate']].values.T]
        model = LogisticRegression()
        model.fit(X,y)                            # (c)
    

    (a) A sequence of the correct length will be generated
    (b) Use values to return an numpy array instead of another dataframe
    (c) fit is done inplace