Search code examples
r

What does runif() mean when used inside if_else()?


Can you help me interpret this code? I am specifically confused about three arguments inside if_else: runif (n()) < 0.1, NA_character_, as.character(cut).

diamonds %>%
  mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
  ggplot() +
  geom_bar(mapping = aes(x = cut)).

source: R for Data Science


Solution

  • I'll assume you understand everything outside of the contents of the mutate call. As others have suggested in the comments, you can find documentation for any of these functions using the ?function syntax.

    dplyr::mutate() is being used here to add a new column, "cut", to the diamonds dataframe, which will replace the old "cut" column:

    cut = ifelse(runif(n)) < 0.1, NA_character_, as.character(cut))
    

    ifelse()

    ifelse is a function that requires three arguments: The first is a conditional ("test"), the second is the value to return if the conditional is true ("yes"), and the third is the value to return if the conditional is false ("no"). Its main advantage over a standard 'if statement' is that it can be vectorised. For example:

    ifelse(test = c(1,2,3) < 3, yes = "less than three", no = "more than two")
    # [1] "less than three" "less than three" "more than two"
    

    runif()

    stats::runif() is a function that generates random numbers between default values of 0 and 1. "runif" is short for "random uniform (number)". Its first argument, "n" is the number of numbers to generate. For example:

    ## set random seed for reproducible results
    set.seed(1)
    ## generate 5 random numbers
    runif(5)
    # [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
    

    n()

    dplyr::n() is a function that can only be used within calls to mutate(), summarise() and filter(). It returns the number of observations within the current group. Assuming that your data is ungrouped, this will be equivalent to nrow(diamonds)

    NA_character_

    It's not obvious, but there are different types of NA value within R. NA values are normally coerced to the correct type, but in some operations (presumably including this one) it is necessary to specify the type of NA that is required. NA_character_ just means a missing character value. Other, similar reserved names in R include NA_integer_ and NA_real_.

    as.character(cut)

    The "cut" data within the diamonds data frame is an ordered factor with five levels. The values of ordered factors are actually integers, each of which pertains to a string stored within the levels attribute of the factor. as.character is a generic function, which means it does slightly different things depending on its input. When the input of as.character is a factor, as.character returns the levels of the factor as a character vector. This sounds complicated, but in practise it's very intuitive:

    my.factor <- factor(c("level 1", "level 2", "level 3", "level 2"))
    
    ## implicitly calling `print.factor`
    my.factor
    # [1] level 1 level 2 level 3 level 2
    # Levels: level 1 level 2 level 3
    
    ## peeking under the hood
    unclass(my.factor)
    # [1] 1 2 3 2
    # attr(,"levels")
    # [1] "level 1" "level 2" "level 3"
    
    ## `as.character` returns the levels pertaining to each element
    as.character(my.factor)
    # [1] "level 1" "level 2" "level 3" "level 2"
    

    Putting it all together

    The call to ifelse achieves the following:

    Generate a vector of random numbers between zero and one whose length is equivalent to the number of rows in the 'diamonds' dataframe. For each of these random numbers, do the following: If the random number is less than 0.1, return a missing character value (NA_character_). Otherwise, return the level-name of the corresponding element of diamonds$cut.

    The call to mutate simply overwrites the previous diamonds$cut (used in the calculation) with this new character vector.