Search code examples
rif-statementcomparison

Efficient equivalent to ifelse but with one option in R?


# Two Ordered Vectors

sequenceA <- c(1, 2, NA, 4)
sequenceB <- c(4, 2, NA, 1) 

df <- data.frame(sequenceA, sequenceB)


sum(ifelse(!is.na(df[1,]), 888, 1) == ifelse(!is.na(df[2,]), 999, 1)) # Number of NA values that are in the same position
#? Number of non-NA values that are in the same position

Let's say I have two observations in a dataframe and I want to compare how similar they are. I want to know specifically two things: how many missing values for specific variables they have in common, and how many non-missing values for specific variables they have in common.

From what I see, neither intersect, %in% or match functions serve for this purpose as they do not consider the order of the values, only if they are found within the set.

I have come with a one-line solution to check the NA values, by replacing by a number (otherwise it just returns NA). Then I want to compare only the overlap among non-NA values, hence I would like to replace the NA for sequenceA with one placeholder value (e.g. "555") and the NA for sequenceB with one different placeholder value (e.g. "666").

I am looking for a one line solution to this: if there was an equivalent to ifelse without the else or a do nothing option, I can see it easily. Most of the similar questions that address this get the reply to just subset the vector and re-assign (<-) a value or use the if(){} commands, which make the solution excessively long (specially if it's something I find myself wanting to do usually). Am I missing an optimal good practice solution in R to this kind of problem?


Solution

  • To get missing values for specific variables they have in common, you can use

    sum(is.na(df$sequenceA) & is.na(df$sequenceB))
    #[1] 1
    

    This can also be read as the number of NA values that are in same position.


    To get how many non-missing values for specific variables they have in common

    sum(!is.na(df$sequenceA) & !is.na(df$sequenceB))
    #[1] 3
    

    This can also be read as the number of non-NA values that are in same position.


    To check for same value, we can do

    sum(df$sequenceA == df$sequenceB, na.rm = TRUE)