I am preparing pricing data for linear regression. My features consist only of days of the week. My target is price. I've made a list of dictionaries of my data, just like the example in sklearn 4.2.1 Loading features from dicts. So the data structure is [{'day': 'friday', 'price': 59}, {'day': 'saturday', 'price': 65}
and so on.
I used sklearn's DictVectorizer per the above link to dummy code the days of the week and convert the data structure to a list of lists (suitable for sklearn LinearRegression).
vec = DictVectorizer()
vec_fit = vec.fit_transform(my_data).toarray()
When I print vec_fit to see the data, I get the output below.
[[ 0. 0. 0. ..., 0. 1. 59.]
[ 0. 0. 0. ..., 0. 0. 92.]
[ 1. 0. 0. ..., 0. 0. 92.]
...,
[ 0. 0. 1. ..., 0. 0. 181.]
[ 0. 0. 0. ..., 0. 0. 181.]
[ 0. 1. 0. ..., 0. 0. 181.]]
Can someone explain (a) the ...,
and (b) why there aren't 7 dummy variables for days of the week? In my example, the ...,
seems to cover Sunday and Thursday.
To check my features (per sklearn 4.2.1), I used the get_feature_names
function.
vec.get_feature_names()
[u'day=Friday', u'day=Monday', u'day=Saturday', u'day=Sunday',
u'day=Thursday', u'day=Tuesday', u'day=Wednesday', 'price']
As shown from the output, all of the days seem to be appropriately represented. I am still confused re: (a) and (b) above. FYI, when I do LinearRegression
I only get 6 coefficients (I am expecting 7; one for each day of week) Thanks.
They are present there, just not shown when you print the vec_fit
. Its the default behaviour of numpy when printing large arrays. Only first 3 and last 3 columns of the data are shown along with first 3 and last 3 rows.
[[ 0. 0. 0. ..., 0. 1. 59.]
[ 0. 0. 0. ..., 0. 0. 92.]
[ 1. 0. 0. ..., 0. 0. 92.]
..., <=== This is for all intermediate data values present. Just not printed
[ 0. 0. 1. ..., 0. 0. 181.]
[ 0. 0. 0. ..., 0. 0. 181.]
[ 0. 1. 0. ..., 0. 0. 181.]]
You can confirm that all data exists by checking the shape of your array.
print(vec_fit.shape)
It should be (n_rows, 8)
. The first value (n_rows
) covers all your samples. The second value (8
) covers your 7 dummy variables and 1 target variable.
If you want to print the full array, then please see these questions: