Search code examples
pythonpython-3.xpandasscikit-learnsklearn-pandas

Fitting pandas Data Frames to Scikit-Learn’s model without using additional libraries or methods


On the one hand, people say pandas goes along great with scikit-learn. For example, pandas series objects fit well with sklearn models in this video. On the other hand, there is sklearn-pandas providing a bridge between Scikit-Learn’s machine learning methods and pandas-style Data Frames which means there is a need for such libraries. Moreover, some people, for example, convert pandas data frames to numpy array for fitting a model.

I wonder whether it's possible to combine pandas and scikit-learn without any additional methods and libraries. My problem is that whenever I fit my data set to sklearn models in the following way:

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC

d = {'x': np.linspace(1., 100., 20), 'y': np.linspace(1., 10., 20)}
df = pd.DataFrame(d)

train, test = train_test_split(df, test_size = 0.2)

trainX = train['x']
trainY = train['y']

lin_svm = SVC(kernel='linear').fit(trainX, trainY)

I receive an error:

ValueError: Unknown label type: 19    10.000000
0      1.000000
17     9.052632
18     9.526316
12     6.684211
11     6.210526
16     8.578947
14     7.631579
10     5.736842
7      4.315789
8      4.789474
2      1.947368
13     7.157895
1      1.473684
6      3.842105
3      2.421053
Name: y, dtype: float64

As far as I understand that's because of the data structure. However, there are few examples on the internet using similar code without any problems.


Solution

  • What you might want to do is a regression and not a classification.

    Think about it, to do a classification, you need either a binary output or a multiclass one. In your case you give continuous data to your classifier.

    If you trace back your error and dig a little bit deeper in sklearn's implementation of the method .fit() you will find the following function:

    def check_classification_targets(y):
    """Ensure that target y is of a non-regression type.
    
    Only the following target types (as defined in type_of_target) are allowed:
        'binary', 'multiclass', 'multiclass-multioutput', 
        'multilabel-indicator', 'multilabel-sequences'
    
    Parameters
    ----------
    y : array-like
    """
    y_type = type_of_target(y)
    if y_type not in ['binary', 'multiclass', 'multiclass-multioutput', 
            'multilabel-indicator', 'multilabel-sequences']:
        raise ValueError("Unknown label type: %r" % y)
    

    And the doc string of the function type_of_target is :

    def type_of_target(y):
    """Determine the type of data indicated by target `y`
    
    Parameters
    ----------
    y : array-like
    
    Returns
    -------
    target_type : string
        One of:
        * 'continuous': `y` is an array-like of floats that are not all
          integers, and is 1d or a column vector.
        * 'continuous-multioutput': `y` is a 2d array of floats that are
          not all integers, and both dimensions are of size > 1.
        * 'binary': `y` contains <= 2 discrete values and is 1d or a column
          vector.
        * 'multiclass': `y` contains more than two discrete values, is not a
          sequence of sequences, and is 1d or a column vector.
        * 'multiclass-multioutput': `y` is a 2d array that contains more
          than two discrete values, is not a sequence of sequences, and both
          dimensions are of size > 1.
        * 'multilabel-indicator': `y` is a label indicator matrix, an array
          of two dimensions with at least two columns, and at most 2 unique
          values.
        * 'unknown': `y` is array-like but none of the above, such as a 3d
          array, sequence of sequences, or an array of non-sequence objects.
    

    In your case type_of_target(trainY)=='continuous' and then it raises aValueErrorin the functioncheck_classification_targets()`.


    Conclusion :

    • If you want to perform a classification, change your target y. (eg. use a binary vector)
    • If you want to keep your continuous data perform a regression. Use svm.SVR.