Search code examples

Best option for missing value imputation for prcomp()

I have a data set of genotypes for approximately 200 individual genomes (columns) for nearly 1,000,000 loci (rows). Due to poor sequencing data, most rows contain 1-2 missing genotypes.

If I use

df_new = na.omit(df)

my new data frame contains only a few thousand rows, leading to a much greater loss in data than I would get by imputing one or two missing values per row. I have been looking online for how to use an imputation option in association with na.option with prcomp(), but cannot find an example. I would like to start with the simplest approach, e.g. replacing NA with a median value or something similar.

Could someone please direct me to an example of how to do this in the context of prcomp?


  • Now I understand your question, see the sample below:

         ddply(df_new, ~ my_groups, transform,
             missing value column = ifelse( value column), 
                          median(missing value column, na.rm = TRUE), 
                                     missing value column))
      #missing value column is the column that consist the missing value
      #my_groups could be the first column of df_new

    I hope this works.