Search code examples
rmissing-dataimputationtrain-test-split

Steps to perform correct data analysis


I have a dataset with 69 columns and 50000 rows. My dataset only contains binary variables and numerical variables. Moreover, some of the binary variables have some missing values (about 5%).

I know I should divide the dataset into train-test-validation and then perform imputation (I want to use mice with the method logreg). I have some questions on this:

  1. Should I perform imputation only on the train set or also on the test and validation sets? If not, how do I fill NAs in test and validation sets?

  2. My professor told me that I should reduce the dimensions of my dataset. Can I use PCA to do this? And do I have to do this before or after imputation? And do I have to apply it only to the train test or also to the other two sets?

  3. Also, I have tried to use mice but it's incredibly slow on my dataset (it took around 50 minutes to impute half of my data). Do you know any methods to speed this process? (I have read here on this forum about methods like quickpred() but it needs to specify the minimum correlation, which I don't how much it is on my dataset.


Solution

  • Personally this is what I would do:

    1. Yes I would impute the values before splitting the dataset.
    2. I would reduced dimensionality after having imputed the data and I would also remove near-zero variance predictors.
    3. I would use the package caret. Check this out. All this things can be done in the train call with a line of code like preProcess = c( "nzv","knnImpute","pca")