I want to train my regression model use sklearn with the following data, and use it to predict the revenue given by other parameters:
But I met some problem when I try to fit my model.
from sklearn import linear_model
model = linear_model.LinearRegression()
train_x = np.array([
[['Tom','Adam'], '005', 50],
[['Tom'], '001', 100],
[['Tom', 'Adam', 'Alex'], '001', 150]
])
train_y = np.array([
50,
80,
90
])
model.fit(train_x,train_y)
>>> ValueError: setting an array element with a sequence.
I have done some search, The problem was that train_x did not have the same number of elements in all the arrays(staff_id). And I think maybe I should add some additional elements into some arrays to make the length consistent. But I have no idea how to do this step exactly. Does this call "vectorize"?
Machine learning models can't take such lists as inputs. It will consider your lists as a list of list of chars (because your list contains strings and each string is a sequence of chars) and probably won't learn anything.
Usually, arrays are used as inputs for models that deal with time-series data, for example in NLP, each record is a timestamp containing a list of words to be processed.
Instead of padding the arrays to be with the same size (as you suggested), you should "explode" your lists into different columns.
Create 3 more columns - one for each staff name: Tom, Adam, and Alex. The value for their cells would be 1 if the name appears in the list or 0 otherwise.
So your table should look like this:
-------------------------------------------------------------------
staff_Tom | staff_Adam | staff_Alex | Manager_id | Budget | Revenue
-------------------------------------------------------------------
1 | 1 | 0 | 5 | 50 | 50 |
1 | 0 | 0 | 1 | 100 | 80 |
1 | 1 | 1 | 1 | 150 | 90 |
....
1 | 0 | 1 | 1 | 75 | ? |
Your model will easily know and identify each staff member and will converge to a solution much quicker.