Search code examples
machine-learningscikit-learndata-processing

How to preprocessing with the fix-length list?


I want to train my regression model use sklearn with the following data, and use it to predict the revenue given by other parameters:

enter image description here

But I met some problem when I try to fit my model.

from sklearn import linear_model

model = linear_model.LinearRegression()


train_x = np.array([
    [['Tom','Adam'], '005', 50],
    [['Tom'], '001', 100],
    [['Tom', 'Adam', 'Alex'], '001', 150]
])

train_y = np.array([
    50,
    80,
    90
])


model.fit(train_x,train_y)


>>> ValueError: setting an array element with a sequence.

I have done some search, The problem was that train_x did not have the same number of elements in all the arrays(staff_id). And I think maybe I should add some additional elements into some arrays to make the length consistent. But I have no idea how to do this step exactly. Does this call "vectorize"?


Solution

  • Machine learning models can't take such lists as inputs. It will consider your lists as a list of list of chars (because your list contains strings and each string is a sequence of chars) and probably won't learn anything.
    Usually, arrays are used as inputs for models that deal with time-series data, for example in NLP, each record is a timestamp containing a list of words to be processed.

    Instead of padding the arrays to be with the same size (as you suggested), you should "explode" your lists into different columns.
    Create 3 more columns - one for each staff name: Tom, Adam, and Alex. The value for their cells would be 1 if the name appears in the list or 0 otherwise.

    So your table should look like this:

    -------------------------------------------------------------------
    staff_Tom | staff_Adam | staff_Alex | Manager_id | Budget | Revenue
    -------------------------------------------------------------------
         1    |      1     |      0     |      5     |   50   |    50  |
         1    |      0     |      0     |      1     |  100   |    80  |
         1    |      1     |      1     |      1     |  150   |    90  |
    ....
         1    |      0     |      1     |      1     |   75   |    ?  |
    
    

    Your model will easily know and identify each staff member and will converge to a solution much quicker.