I have used mice imputation to fill missing values of a machine learning dataset. The dataset is huge, 11726412 row and 30 columns. Here is the number of missing values in this data:
In [2]:X.isnull().sum()
Out[2]:
time 0
count_neshan 0
count_scat_o 4059792
count_avl_en_o 7364664
count_avl_ex_o 7364664
count_anpr_o 9646200
karmnd_dr_mhl_shghl_o 0
veh_own_o 0
n_bussi_unit_o 0
park_area_o 0
area_o 0
office_land_use_o 0
n_office_o 0
commercial_unit_o 0
n_commercial_o 0
schl_o 0
count_scat_d 4059792
count_avl_en_d 7364664
count_avl_ex_d 7364664
count_anpr_d 9646200
karmnd_dr_mhl_shghl_d 0
veh_own_d 0
n_bussi_unit_d 0
park_area_d 0
area_d 0
office_land_use_d 0
n_office_d 0
commercial_unit_d 0
n_commercial_d 0
schl_d 0
dtype: int64
I ran this code to impute missing values in the dataset:
from impyute.imputation.cs import mice
imputed_train_data = mice(X.values)
This is the first time I am using mice and, I have no estimation of the time it will take to run. I conducted this code 8 days ago and it is still running.
I could not find anything about the run time of mice. All I know is "it's slow". I would appreciate it if anyone experienced with the subject could estimate the time or suggest a faster alternative considering the big dataset.
According to the docs mice
runs until convergence which is defined as less than 10% change between consecutive updates on all imputed values. This means that it is unpredictable when it will stop. My intuition would say that the probability of none of the imputation updates being smaller than 10% becomes very small with a large number of missing values.
Seeing that the source code is actually rather simple, you could write your own version that limits the number of iterations. It seems that one comment in the source actually indicates that this was the case for the original implementation at some point:
# Step 5: Repeat step 2 - 4 until convergence (the 100 is arbitrary)
You could replace the while all(converged):
with for _ in range(max_iterations):
.