I am trying to convert all cells with non-numeric values to missing data (NA). I tried something similar along the lines of converting specific values to missing data, like:
recode_missing <- function (g, misval)
{
a <- g == misval
temp = g
temp [a] <- NA
return (temp)
}
That works great: an elegant R solution.
I tried to decode like a <- g == is.numeric ()
(syntactically wrong), a <- is.numeric (g): (Error: (list) object cannot be coerced to type 'double'), or even
a [,] <- is.numeric (g[,]` (same). I an aware of the solution of removing columns
remove_nn <- function (data)
{
# removes all non-numeric columns
numeric_columns <- sapply (data, is.numeric)
return (data [, numeric_columns])
} ### remove_nn ###
But that removes the columns and converts the data frame to some matrix.
Could someone please advise on how to convert single non-numeric cells to NA while leaving the data structure intact?
Edit
As the comments point out correctly there is no such thing as an individual string value in an ocean of numeric values. Just vectors which are numeric or something else. What I now wanted to know what caused the non-numeric error in medians <- apply (data, 2, median)
. I have many vectors and inspection by eye proved useless. I issued num <- sapply (data, is.numeric)
and next data [,!num]
. That gave me the columns that were non-numeric. In one case that was caused by one cell value containing a superfluous ". The file is preprocessed by a spreadsheet and if just one cell is non-numeric, the complete vector is seen as non-numeric.
Based on your edit, you have vectors which should be numeric, but due to some erroneous data introduced during the reading-in process, the data have been converted to another format (likely character
or factor
).
Here is an example of that case. mydf1 <- mydf2 <- mydf3 <-
data.frame(...)
just creates three data.frame
s with the same data.
# I'm going to show three approaches
mydf1 <- mydf2 <- mydf3 <- data.frame(
A = c(1, 2, "x", 4),
B = c("y", 3, 4, "-")
)
str(mydf1)
# 'data.frame': 4 obs. of 2 variables:
# $ A: Factor w/ 4 levels "1","2","4","x": 1 2 4 3
# $ B: Factor w/ 4 levels "-","3","4","y": 4 2 3 1
One way to do this is to just let R coerce any values that cannot be converted to numeric to NA
:
## You WILL get warnings
mydf1[] <- lapply(mydf1, function(x) as.numeric(as.character(x)))
# Warning messages:
# 1: In FUN(X[[i]], ...) : NAs introduced by coercion
# 2: In FUN(X[[i]], ...) : NAs introduced by coercion
str(mydf1)
# 'data.frame': 4 obs. of 2 variables:
# $ A: num 1 2 NA 4
# $ B: num NA 3 4 NA
Another option is to use makemeNA
from my SOfun package:
library(SOfun)
makemeNA(mydf2, "[^0-9]", FALSE)
# A B
# 1 1 NA
# 2 2 3
# 3 NA 4
# 4 4 NA
str(.Last.value)
# 'data.frame': 4 obs. of 2 variables:
# $ A: int 1 2 NA 4
# $ B: int NA 3 4 NA
This function is a bit different in that it uses type.convert
to do the conversion, and can handle more specific rules for conversion to NA
(just like you can use a vector for na.strings
when reading data into R).
About your error, I believe you would have tried as.numeric
on your data.frame
to get the error you had shown.
Example:
# Your error...
as.numeric(mydf3)
# Error: (list) object cannot be coerced to type 'double'
You won't get that error on a matrix
though (but you'll still get the warning)....
# You'll get a warning
as.numeric(as.matrix(mydf3))
# [1] 1 2 NA 4 NA 3 4 NA
# Warning message:
# NAs introduced by coercion
Why don't we need to explicitly use as.character
? as.matrix
does that for you:
str(as.matrix(mydf3))
# chr [1:4, 1:2] "1" "2" "x" "4" "y" "3" "4" "-"
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:2] "A" "B"
How can you use that information?
mydf3[] <- as.numeric(as.matrix(mydf3))
# Warning message:
# NAs introduced by coercion
str(mydf3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: num 1 2 NA 4
# $ B: num NA 3 4 NA