Search code examples
rdplyrtidyverse

Difference in Output Between Single and Double Bracket Indexing in R’s case_when()


I’m working with a list in R and I’ve noticed an unexpected difference in output when using single and double bracket indexing in the case_when() function from the dplyr package. Here’s the sample list I’m working with:

list1 <- list(a = as.data.frame(cbind(1:5,6:10)), b = "Hello", c = list(x = 10, y =20))

When I use double bracket indexing in case_when(), I get one result:

case_when(list1[[1]][['V1']]==1~'a',
          list1[[1]][['V1']]==3~NA,
          list1[[1]][['V1']] %in% c(2:4)~'b')

But when I use single bracket indexing in case_when(), I get a different result:

case_when(list1[[1]]['V1']==1~'a',
          list1[[1]]['V1']==3~NA,
          list1[[1]]['V1'] %in% c(2:4)~'b')

Can anyone explain why there’s a difference in output between these two methods of indexing in case_when()? Any insights would be greatly appreciated! I think the difference is related to '%in%'.

Update: The difference is from %in%:

I know the difference between [ and [[a long time ago. And the code provided above has proved that it's more likely related to %in% instead of [/[[. Because the output of

case_when(list1[[1]][['V1']]==1~'a',
          list1[[1]][['V1']]==3~NA,
          list1[[1]][['V1']] %in% c(2:4)~'b')

is:

[1] "a" "b" NA  "b" NA 

The output of

case_when(list1[[1]]['V1']==1~'a',
          list1[[1]]['V1']==3~NA,
          list1[[1]]['V1'] %in% c(2:4)~'b')

is:

[1] "a" NA  NA  NA  NA 

Therefore the difference should be related to %in% function.

According to the documentation of %in%,

match {base}    R Documentation
Value Matching
Description
match returns a vector of the positions of (first) matches of its first argument in its second.

I input data frame instead of vector in

case_when(list1[[1]]['V1']==1~'a',
          list1[[1]]['V1']==3~NA,
          list1[[1]]['V1'] %in% c(2:4)~'b')

therefore an unexpected output occurred.


Solution

  • The different indexing returns different objects, one an atomic vector and the other a data.frame. When you pass those objects to the %in% operator, it produces different results. These different results affect the output of case_when() because one contains a value for each element of the vector and the other contains a single value for the entire data.frame. This causes case_when() to rely on default NA values, which is why you get more NAs in the one-bracket case. In general, you'll want to test VECTORS rather than putting other objects on the left-hand-side of case_when() calls. Below I demonstrate this.

    Demonstration

    library(dplyr)
    
    # Simpler example data
    list1 <- list(a = data.frame(V1 = 1:5))
    

    Let's use case_when with single and double brackets just like you did. Here, we'll replace NA with NA_character to explicitly avoid type conflicts that can arise using case_when().

    # Double bracket
    brack2 <- case_when(list1[[1]][['V1']]==1~'a',
              list1[[1]][['V1']]==3~NA_character_,
              list1[[1]][['V1']] %in% 2:4 ~'b')
    
    # Single bracket
    brack1 <- case_when(list1[[1]]['V1']==1~'a',
              list1[[1]]['V1']==3~NA_character_,
              list1[[1]]['V1'] %in% 2:4 ~'b')
    
    # Compare outputs - Single brackets gives NA where double brackets give "b"
    brack2 # double
    #> [1] "a" "b" NA  "b" NA
    brack1 # single
    #> [1] "a" NA  NA  NA  NA
    

    Why do we get different results? Let's look at what object is returned from indexing each way.

    brack2_obj <- list1[[1]][['V1']] # double
    brack1_obj <- list1[[1]]['V1'] # single
    
    brack2_obj # double object
    #> [1] 1 2 3 4 5
    brack1_obj # single object
    #>   V1
    #> 1  1
    #> 2  2
    #> 3  3
    #> 4  4
    #> 5  5
    

    These are obviously different objects! Let's see what they are...

    str(brack2_obj) # atomic vector (int)
    #>  int [1:5] 1 2 3 4 5
    str(brack1_obj) # data.frame with a column (int)
    #> 'data.frame':    5 obs. of  1 variable:
    #>  $ V1: int  1 2 3 4 5
    

    Okay so that could be problematic. Let's do the logical tests you asked case_when to do but one at a time to see their output and on each object to see if this is creating issues in the logical testing.

    brack2_obj == 1
    #> [1]  TRUE FALSE FALSE FALSE FALSE
    
    brack1_obj == 1
    #>         V1
    #> [1,]  TRUE
    #> [2,] FALSE
    #> [3,] FALSE
    #> [4,] FALSE
    #> [5,] FALSE
    

    That produces true and false where we expect and in the quantity we expect. Sure, the structures differ but we can dive into whether that matters if we don't see any other issue.

    brack2_obj == 3
    #> [1] FALSE FALSE  TRUE FALSE FALSE
    
    brack1_obj == 3
    #>         V1
    #> [1,] FALSE
    #> [2,] FALSE
    #> [3,]  TRUE
    #> [4,] FALSE
    #> [5,] FALSE
    

    Since this isn't really any different than == 1, we expected similar results. No problems yet.

    brack2_obj %in% 2:4
    #> [1] FALSE  TRUE  TRUE  TRUE FALSE
    
    brack1_obj %in% 2:4
    #> [1] FALSE
    

    We've identified our issue! Using double brackets returns a vector testing each element in the atomic vector (length 5). However, we only get a single FALSE for single brackets. The rest are NA because they're not given a value through case_when() and NA is the function's default for no returned value.

    We can prove this by simply assigning anything that is NOT assigned a value via the first 3 statements in the case_when() call some string. Here I call it "PROOF". We'll see the NAs turn to "PROOF" if that's whats going on for those values.

    case_when(list1[[1]]['V1']==1~'a',
              list1[[1]]['V1']==3~NA_character_,
              list1[[1]]['V1'] %in% 2:4 ~'b',
              TRUE ~ "PROOF")
    #> [1] "a"     "PROOF" NA      "PROOF" "PROOF"
    

    The NA remains for the value equal to 3 and changes for the other that were set to NA by default. Behold, the difference has been identified!

    This shows that you'll almost always want to pass a vector to case_when(). Here's an example of doing so:

    # Example data
    df <-   data.frame(a = 1:5)
    
    # Produces NAs because we're passing a data.frame to %in%
    case_when(df == 1 ~ "foo",
              df %in% 2:3 ~ "bar",
              df > 3 ~ "pow")  
    #> [1] "foo" NA    NA    "pow" "pow"
    
    # Can fix with new default values, but this is bad coding practice since we 
    # could have other reasons for NAs!
    case_when(df == 1 ~ "foo",
              df %in% 2:3 ~ "bar",
              df > 3 ~ "pow",
              TRUE ~ "bar")  
    #> [1] "foo" "bar" "bar" "pow" "pow"
    
    # Instead, pass the vector (this is equivalent to double indexing) to test if
    # the values are in a range
    case_when(df == 1 ~ "foo",
              df$a %in% 2:3 ~ "bar",
              df > 3 ~ "pow")  
    #> [1] "foo" "bar" "bar" "pow" "pow"
    

    While in the above, I use df$a to extract the vector, you can use any other form of indexing that accomplishes this task. Some examples:

    identical(df$a, df[[1]], df[1][[1]], df %>% dplyr::pull(a))
    [1] TRUE