pandas scikit-learn slice sparse-matrix dummy-variable

converting categorical variables to be used in sklean

I created a sparse matrix using pd.get_dummies function. The matrix I have is 700M rows * 400 columns, I don't think it is that large compared to lots of problems other people are solving. But slicing into train, val, test set can take forever. (I will use logistic regression and random forest to do the prediction, which support sparse matrix. )Is there anyway to efficiently slicing a sparseDataFrame or for the whole process I am doing, it should be improved in anyway?

Given an example,

This is the list of columns I have before transforming categorical variable into dummy variables :

[u'a.exch', u'a.is_mobile', u'a.os_family', u'a.os_major', u'a.ua_family', u'a.ua_major', u'a.creative_id', u'a.creative_format',u'a.banner_position', u'a.day_hour_etc', u'b.country', u'b.connspeed',u'b.home_bus']

This is the number of unique values in each column:

a.exch 14
a.is_mobile 2
a.os_family 21
a.os_major 35
a.ua_family 49
a.ua_major 56
a.creative_id 30
a.creative_format 3
a.banner_position 6
a.day_hour_etc 4
b.country 94
b.connspeed 9
b.home_bus 3

After using pd.get_dummies, it has 300+ columns, for example

a.exch_1, a.exch_2, ..., b.home_bus1, b.home_bus2

I set the pd.get_dummies(input_df, sparse=True) because otherwise it will raise memory error. But now with this sparse representation, everything is really slow.

Update: to split into train, val and test, just randomly split into 3 parts with 6:2:2

Solution

Having 700M rows of data set is huge. And, by using get dummes you are almost making it 20 times larger.

Use df.column =pd.factorize(df.column)[0]

DictVectorizer

I not sure about performance but it will not be as worse as get_dummies, since this will not create 380+ columns. I guess, sub-setting is start of problems next would be training model will run forever with this much of data.