Search code examples
rpipeuniquemultiple-columns

using pipes for unique() function


Below is the code i used to do a mode imputation for the column status group of the dataset tan1. How do I rewrite the same using pipes? the unique() function does not seem to work in pipes.

NA_stat <- unique(tan1$status_group[!is.na(tan1$status_group)])

mode <- NA_stat[which.max(tabulate(match(tan1$status_group, NA_stat)))]

tan1$status_group[is.na(tan1$status_group)] <- mode  

Also, how do I apply this same process for multiple columns?


Solution

  • Here are some examples of determining and imputing the mode in a pipe.

    Functions to calculate mode:

    library(tidyverse)
    
    # Single mode (returns only the first mode if there's more than one)
    # https://stackoverflow.com/a/8189441/496488
    # Modified to remove NA
    Mode <- function(x) {
      ux <- na.omit(unique(x))
      ux[which.max(tabulate(match(x, ux)))]
    }
    
    # Return all modes if there's more than one
    # https://stackoverflow.com/a/8189441/496488
    # Modified to remove NA
    Modes <- function(x) {
      ux <- na.omit(unique(x))
      tab <- tabulate(match(x, ux))
      ux[tab == max(tab)]
    }
    

    Apply the functions to a data frame:

    iris %>% 
      summarise(across(everything(), Mode))
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    #> 1            5           3          1.4         0.2  setosa
    
    iris %>% map(Modes)
    #> $Sepal.Length
    #> [1] 5
    #> 
    #> $Sepal.Width
    #> [1] 3
    #> 
    #> $Petal.Length
    #> [1] 1.4 1.5
    #> 
    #> $Petal.Width
    #> [1] 0.2
    #> 
    #> $Species
    #> [1] setosa     versicolor virginica 
    #> Levels: setosa versicolor virginica
    

    Impute missing data using the mode. But note that we use Mode, which returns only the first mode in cases where there are multiple modes. You may need to adjust your method if you have multiple modes.

    # Create missing data
    d = iris
    d[1, ] = rep(NA, ncol(iris))
    
    head(d)
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    #> 1           NA          NA           NA          NA    <NA>
    #> 2          4.9         3.0          1.4         0.2  setosa
    #> 3          4.7         3.2          1.3         0.2  setosa
    #> 4          4.6         3.1          1.5         0.2  setosa
    #> 5          5.0         3.6          1.4         0.2  setosa
    #> 6          5.4         3.9          1.7         0.4  setosa
    
    # Replace missing values with the mode
    d = d %>% 
      mutate(across(everything(), ~coalesce(., Mode(.))))
    
    head(d)
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    #> 1          5.0         3.0          1.5         0.2 versicolor
    #> 2          4.9         3.0          1.4         0.2     setosa
    #> 3          4.7         3.2          1.3         0.2     setosa
    #> 4          4.6         3.1          1.5         0.2     setosa
    #> 5          5.0         3.6          1.4         0.2     setosa
    #> 6          5.4         3.9          1.7         0.4     setosa