Search code examples
rvectorboolean-expression

Identity and vectors in R: how does MyData1 == c("a", "b") work (or not)


I made a mistake and instead of writing MyData1 %in% c("a", "b")

...I wrote MyData1 == c("a", "b")

...but I'd like to know how and why this doesnt work. Why does the following happen?

> MyData1 <- rep(c("a", "b", "b"), 4)
> MyData1
 [1] "a" "b" "b" "a" "b" "b" "a" "b" "b" "a" "b" "b"
> MyData1 == c("a", "b")
 [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

Why is the first result TRUE and the second FALSE? Try:

> MyData1[1] == c("a", "b")
[1]  TRUE FALSE
> MyData1[2] == c("a", "b")
[1] FALSE  TRUE
> MyData1[1:2] == c("a", "b")
[1] TRUE TRUE

...I'm none the wiser...now I get two items back whether I test 1 or 2 elements of the vector!


Solution

  • This is because of vector recycling. == does element wise comparison whereas %in% checks for the value in the entire vector and it does not matter where the position of that element is.

    When one vector/value is shorter in length than the other one R recycles the value and makes them of equal length.

    When you do

    MyData1 == c("a", "b")
    

    1st value in MyData1 is compared with "a", 2nd value in MyData1 is compared with "b". Now since the vector c("a", "b") is shorter R recycles the same values again so 3rd value of MyData1 is compared with "a" again and 4th with "b" and so on.

    In the next part again vector recycling happens but this time in opposite direction.

    MyData1[1] == c("a", "b")
    

    MyData1[1] is of length 1 so 1st value of MyData1[1] is compared with "a" and now the same value is repeated again to compare with "b" so that you get.

    #[1]  TRUE FALSE