Search code examples
rdplyrtidymodels

How to use %>% in tidymodels in R?


I am trying to split a dataset from tidymodels in R.

library(tidymodels)
data(Sacramento, package = "modeldata")
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)

I want to describe the distribution of the training dataset, but the following error occurs.

Sac_train %>% 
      select(price) %>%
      summarize(min_sell_price = min(),
                max_sell_price = max(),
                mean_sell_price = mean(),
                sd_sell_price = sd())
# Error: In min() : no non-missing arguments to min; returning Inf

However, the following code works.

Sac_train %>%
  summarize(min_sell_price = min(price),
            max_sell_price = max(price),
            mean_sell_price = mean(price),
            sd_sell_price = sd(price))

My question is: why select(price) is not working in the first example? Thanks.


Solution

  • Assuming your data are a data frame, despite having only one column selected, you still need to tell R/dplyr what column you want to summarize.

    In other words, it doesn't treat a single-column data frame as a vector that you can pass through a function - i.e.:

    Sac_train.vec <- 1:25
    mean(Sac_train.vec)
    # [1] 13
    

    will calculate the mean, whereas

    Sac_train.df <- data.frame(price = 1:25)
    mean(Sac_train.df)
    

    throws an error.

    In the special case of only one column, this may be more parsimonious code:

    # Example Data
    Sac_train <- data.frame(price = 1:25, col2 = LETTERS[1:25])
    
    Sac_train %>% 
      select(price) %>%
      summarize(across(everything(), 
                       list(min = min, max = max, mean = mean, sd = sd)))
    

    Output:

    #   price_min price_max price_mean price_sd
    # 1         1        25         13 7.359801