Search code examples
pythonnumpyscikit-learndictvectorizer

Making dummy variables for days of week using sklearn DictVectorizer


I am preparing pricing data for linear regression. My features consist only of days of the week. My target is price. I've made a list of dictionaries of my data, just like the example in sklearn 4.2.1 Loading features from dicts. So the data structure is [{'day': 'friday', 'price': 59}, {'day': 'saturday', 'price': 65} and so on.

I used sklearn's DictVectorizer per the above link to dummy code the days of the week and convert the data structure to a list of lists (suitable for sklearn LinearRegression).

vec = DictVectorizer()
vec_fit = vec.fit_transform(my_data).toarray()

When I print vec_fit to see the data, I get the output below.

[[   0.    0.    0. ...,    0.    1.   59.]
 [   0.    0.    0. ...,    0.    0.   92.]
 [   1.    0.    0. ...,    0.    0.   92.]
 ...,
 [   0.    0.    1. ...,    0.    0.  181.]
 [   0.    0.    0. ...,    0.    0.  181.]
 [   0.    1.    0. ...,    0.    0.  181.]]

Can someone explain (a) the ..., and (b) why there aren't 7 dummy variables for days of the week? In my example, the ..., seems to cover Sunday and Thursday.

To check my features (per sklearn 4.2.1), I used the get_feature_names function.

vec.get_feature_names()

[u'day=Friday', u'day=Monday', u'day=Saturday', u'day=Sunday', 
 u'day=Thursday', u'day=Tuesday', u'day=Wednesday', 'price']

As shown from the output, all of the days seem to be appropriately represented. I am still confused re: (a) and (b) above. FYI, when I do LinearRegression I only get 6 coefficients (I am expecting 7; one for each day of week) Thanks.


Solution

  • They are present there, just not shown when you print the vec_fit. Its the default behaviour of numpy when printing large arrays. Only first 3 and last 3 columns of the data are shown along with first 3 and last 3 rows.

    [[   0.    0.    0. ...,    0.    1.   59.]
     [   0.    0.    0. ...,    0.    0.   92.]
     [   1.    0.    0. ...,    0.    0.   92.]
     ..., <=== This is for all intermediate data values present. Just not printed
     [   0.    0.    1. ...,    0.    0.  181.]
     [   0.    0.    0. ...,    0.    0.  181.]
     [   0.    1.    0. ...,    0.    0.  181.]]
    

    You can confirm that all data exists by checking the shape of your array.

    print(vec_fit.shape)
    

    It should be (n_rows, 8). The first value (n_rows) covers all your samples. The second value (8) covers your 7 dummy variables and 1 target variable.

    If you want to print the full array, then please see these questions: