Search code examples
rmissing-data

How to convert all non numeric cells in data frame to NA


I am trying to convert all cells with non-numeric values to missing data (NA). I tried something similar along the lines of converting specific values to missing data, like:

recode_missing <- function (g, misval)
{
  a <- g == misval
  temp = g
  temp [a] <- NA
  return (temp)
}

That works great: an elegant R solution.

I tried to decode like a <- g == is.numeric () (syntactically wrong), a <- is.numeric (g): (Error: (list) object cannot be coerced to type 'double'), or evena [,] <- is.numeric (g[,]` (same). I an aware of the solution of removing columns

remove_nn <- function (data)
{
  # removes all non-numeric columns
  numeric_columns <- sapply (data, is.numeric)
  return (data [, numeric_columns])
} ### remove_nn ###

But that removes the columns and converts the data frame to some matrix.

Could someone please advise on how to convert single non-numeric cells to NA while leaving the data structure intact?

Edit

As the comments point out correctly there is no such thing as an individual string value in an ocean of numeric values. Just vectors which are numeric or something else. What I now wanted to know what caused the non-numeric error in medians <- apply (data, 2, median). I have many vectors and inspection by eye proved useless. I issued num <- sapply (data, is.numeric) and next data [,!num]. That gave me the columns that were non-numeric. In one case that was caused by one cell value containing a superfluous ". The file is preprocessed by a spreadsheet and if just one cell is non-numeric, the complete vector is seen as non-numeric.


Solution

  • Based on your edit, you have vectors which should be numeric, but due to some erroneous data introduced during the reading-in process, the data have been converted to another format (likely character or factor).

    Here is an example of that case. mydf1 <- mydf2 <- mydf3 <- data.frame(...) just creates three data.frames with the same data.

    # I'm going to show three approaches
    mydf1 <- mydf2 <- mydf3 <- data.frame(
      A = c(1, 2, "x", 4),
      B = c("y", 3, 4, "-")
    )
    
    str(mydf1)
    # 'data.frame': 4 obs. of  2 variables:
    #  $ A: Factor w/ 4 levels "1","2","4","x": 1 2 4 3
    #  $ B: Factor w/ 4 levels "-","3","4","y": 4 2 3 1
    

    One way to do this is to just let R coerce any values that cannot be converted to numeric to NA:

    ## You WILL get warnings
    mydf1[] <- lapply(mydf1, function(x) as.numeric(as.character(x)))
    # Warning messages:
    # 1: In FUN(X[[i]], ...) : NAs introduced by coercion
    # 2: In FUN(X[[i]], ...) : NAs introduced by coercion
    
    str(mydf1)
    # 'data.frame': 4 obs. of  2 variables:
    #  $ A: num  1 2 NA 4
    #  $ B: num  NA 3 4 NA
    

    Another option is to use makemeNA from my SOfun package:

    library(SOfun)
    makemeNA(mydf2, "[^0-9]", FALSE)
    #    A  B
    # 1  1 NA
    # 2  2  3
    # 3 NA  4
    # 4  4 NA
    
    str(.Last.value)
    # 'data.frame': 4 obs. of  2 variables:
    #  $ A: int  1 2 NA 4
    #  $ B: int  NA 3 4 NA
    

    This function is a bit different in that it uses type.convert to do the conversion, and can handle more specific rules for conversion to NA (just like you can use a vector for na.strings when reading data into R).


    About your error, I believe you would have tried as.numeric on your data.frame to get the error you had shown.

    Example:

    # Your error...
    as.numeric(mydf3)
    # Error: (list) object cannot be coerced to type 'double'
    

    You won't get that error on a matrix though (but you'll still get the warning)....

    # You'll get a warning
    as.numeric(as.matrix(mydf3))
    # [1]  1  2 NA  4 NA  3  4 NA
    # Warning message:
    # NAs introduced by coercion 
    

    Why don't we need to explicitly use as.character? as.matrix does that for you:

    str(as.matrix(mydf3))
    #  chr [1:4, 1:2] "1" "2" "x" "4" "y" "3" "4" "-"
    #  - attr(*, "dimnames")=List of 2
    #   ..$ : NULL
    #   ..$ : chr [1:2] "A" "B"
    

    How can you use that information?

    mydf3[] <- as.numeric(as.matrix(mydf3))
    # Warning message:
    # NAs introduced by coercion 
    
    str(mydf3)
    # 'data.frame': 4 obs. of  2 variables:
    #  $ A: num  1 2 NA 4
    #  $ B: num  NA 3 4 NA