Search code examples
rtime-seriesmissing-dataimputationr-mice

How to choose best imputed data for further analysis in r


I have a multivariate time series dataset (almost 30 years) with random missing values.

T S po4 si din
9.00000 NA 0.290 5.310 18.51
8.45000 NA 0.130 6.180 14.74
13.60000 36.46000 0.010 0.500 1.86
23.20000 32.12000 0.010 6.580 0.81
26.00000 32.13000 0.070 0.500 0.23
NA 35.41400 0.010 1.670 0.72
24.80000 36.42000 0.000 3.540
24.20000 33.16000 0.110 2.020
22.50000 37.60000 0.040 0.400
16.32000 36.01000 0.020 2.900
17.60000 38.04000 0.010 0.970
9.70000 36.36000 0.120 7.950
13.80000 38.33000 0.010 5.760
7.90000 35.51000 0.060 2.350
11.90000 38.33000 0.030 3.410
24.10000 36.30000 0.020 0.730
25.20000 35.77000 0.020 1.370
24.70000 37.54000 0.330 0.700
5.75000 33.26000 0.120 0.860
13.30000 33.14000 0.000 0.000
13.60000 38.21265 0.000 0.190
15.70000 28.33000 0.040 11.500 41.64

I would like to fill the gaps in order to have a constant frequency (I have a monthly frequency with missing values) to try different techniques in the content of a time series analysis. I have tried to use the mice package in r and to decide which imputed dataset to use with with() and pool(),but I don't want to use all of them in a model, but obtain the most correct one and use that one for further analysis. How can I do that? How can I find the best one?

Otherwise, can you suggest me another method as an alternative to mice?

Thank you very much in advance


Solution

  • If you have a strong time correlation you can use the imputets package for time series imputation.

    library(imputeTS)
    na_kalman(your_dataframe)
    

    There are also several other methods included in the package. As for mice, the whole point of multiple imputation is to have several imputed datasets. You would perform your analysis separately on each of them. Then you can compare the results. Since with imputation there always comes some uncertainty along (since your missing data replacements are only estimations). This technique enables you to model / get a feeling for the uncertainty.

    If you don't want to do multiple analysis and do single imputation you can use any of these datasets, they are equally valid/there is no best one.

    Or you could also use a single imputation package like misssForest.