Search code examples
rggplot2dplyrtruncatequantile

R - optimal way to truncate with dplyr


I'm using R and ggplot to visualise variable distributions. But most of the time, because of some extrem values, I have to truncate the variable to generate a better visualisation. For instance:

library(tidyverse)

data.frame(x = c(runif(500, min = 0, max = 1), 1e3)) %>%
  ggplot() + geom_density(aes(x = x))

enter image description here

I use the base functions quantile() and ifelse() to truncate and get a better visualisation. But I don't feel it is optimal, the function quantile() is repeted, meaning it's calculated twice. Does someone now a more optimal way? (without saving the quantile in a previous step)

data.frame(x = c(runif(500, min = 0, max = 1), 1e3)) %>%
  mutate_at(vars(x), list(~ ifelse(. > quantile(., .99), quantile(., .99), .))) %>% 
  ggplot() + geom_density(aes(x = x))

enter image description here


Solution

  • data.frame(x = c(runif(500, min = 0, max = 1), 1e3)) %>%
      mutate_at(vars(x), list(~ pmin(., quantile(., .99)))) %>% 
      ggplot() + geom_density(aes(x = x))
    

    pmin does vector-wise mins, ala

    x <- sample(10)
    x
    #  [1] 10  9  6  4  5  3  2  1  7  8
    pmin(x, 5)
    #  [1] 5 5 5 4 5 3 2 1 5 5
    

    And you only calculate the quantile once.

    FYI, mutate_at has been superseded by the use of across.

    data.frame(x = c(runif(500, min = 0, max = 1), 1e3)) %>%
      mutate(across(x, ~ pmin(., quantile(., .99)))) %>% 
      ggplot() + geom_density(aes(x = x))
    

    Note that the list(~ quantile(., 0.99)) method is still supported, but when a list, the naming convention is different. Compare:

    set.seed(42)
    x <- data.frame(x = c(runif(500, min = 0, max = 1), 1e3))
    x %>%
      mutate(across(x, list(~ pmin(., quantile(., .99))))) %>%
      head(.)
    #           x       x_1
    # 1 0.9148060 0.9148060
    # 2 0.9370754 0.9370754
    # 3 0.2861395 0.2861395
    # 4 0.8304476 0.8304476
    # 5 0.6417455 0.6417455
    # 6 0.5190959 0.5190959
    x %>%
      mutate(across(x, ~ pmin(., quantile(., .99)))) %>%
      head(.)
    #           x
    # 1 0.9148060
    # 2 0.9370754
    # 3 0.2861395
    # 4 0.8304476
    # 5 0.6417455
    # 6 0.5190959
    

    (where the list method produces a new column named x_1, but ggplot2 is still looking at the untruncated x).