Search code examples
rdataframepcanaprincomp

Omit NA and data imputation before doing PCA analysis using R


I am trying to do PCA analysis using princomp function in R.

The following is the example code:

mydf <- data.frame (
    A = c("NA", rnorm(10, 4, 5)), 
    B = c("NA", rnorm(9, 4, 5), "NA"),
    C =  c("NA", "NA", rnorm(8, 4, 5), "NA")
)

out <- princomp(mydf, cor = TRUE, na.action=na.exclude)

Error in cov.wt(z) : 'x' must contain finite values only

I tried to remove the NA from the dataset, but it does not work.

ndnew <- mydf[complete.cases(mydf),]

                   A                  B                C
1                  NA                 NA               NA
2    1.67558617743171   1.28714736288378               NA
3   -1.03388645096478    9.8370942023751 10.9522215389562
4    7.10494481721949   14.7686678743866 4.06560213642725
5     13.966212462717   3.92061729913733 7.12875100279949
6   -1.91566982754146  0.842774330179978 5.26042516598668
7  0.0974919570675357    5.5264365812476 6.30783046905425
8    12.7384749395121   4.72439301946042  2.9318845479507
9    13.1859349108349 -0.546676530952666 9.98938028956806
10   4.97278207223239   6.95942086859593 5.15901566720956
11  -4.10115142119221                 NA               NA

Even if I can remove the NA's it might not be of help as every rows or column has at least one missing values. Is there any R method that can impute the data doing PCA analysis?


UPDATE: based on the answers:

> mydf <- data.frame (A = c(NA, rnorm(10, 4, 5)), B = c(NA, rnorm(9, 4, 5), NA),
+  C =  c(NA, NA, rnorm(8, 4, 5), NA))
> out <- princomp(mydf, cor = TRUE, na.action=na.exclude)
Error in cov.wt(z) : 'x' must contain finite values only

ndnew <- mydf[complete.cases(mydf),]
out <- princomp(ndnew, cor = TRUE, na.action=na.exclude)

This works but the defult na.action does not work.

Is there is any method that can impute the data, as in real data I have almost every column with missing value in them? The result of such NA omission will give me ~ 0 rows or columns.


Solution

  • For na.action to have an effect, you need to explicitly supply a formula argument:

    princomp(formula = ~., data = mydf, cor = TRUE, na.action=na.exclude)
    
    # Call:
    # princomp(formula = ~., data = mydf, na.action = na.exclude, cor = TRUE)
    # 
    # Standard deviations:
    #    Comp.1    Comp.2    Comp.3 
    # 1.3748310 0.8887105 0.5657149 
    

    The formula is needed because it triggers dispatch of princomp.formula, the only princomp method that does anything useful with na.action.

    methods('princomp')
    [1] princomp.default* princomp.formula*
    
    names(formals(stats:::princomp.formula))
    [1] "formula"   "data"      "subset"    "na.action" "..."  
    
    names(formals(stats:::princomp.default))
    [1] "x"      "cor"    "scores" "covmat" "subset" "..."