Search code examples
rdplyrr-recipes

R package or function to record filters applied to your tibble


Does there exist any R function or packages that records the operations applied to a tibble/data frame?

For example, if I did the following

data(iris)
my_table <- iris %>% filter(Sepal.Length>6) %>% filter(Species == 'virginica')

I would want the output to be something of the form

display_filter_function(my_table)

output:
Step   filter
1       sepal.length > 6
2       Species == 'virginica'

I am thinking that this would be something similar to the functionality provided by the recipes package, but not needing to use the step_ family of function


Solution

  • I've written a little module for you. It is a standalone resource and has only one dependency beyond base R: namely dplyr itself. The module is long, so I have put it at the bottom of this post. You can find the code itself under the Module section, and its usage is demonstrated under the Usage section.

    This model could theoretically be extended to all dplyr functions, and to other generic functions as well. To keep things manageable, I myself have implemented it for dplyr::filter() alone.

    Background

    This module leverages the R concept of generic methods, like print() and format() and mean() and summary(). Suppose you wish to print() a data.frame object. The generic print() function...

    print
    #> function (x, ...) 
    #> UseMethod("print")
    #> <bytecode: 0x000002429186e2c8>
    #> <environment: namespace:base>
    

    ...does not do the work itself! Rather, it dispatches to some print.*() method, via the line:

    UseMethod("print")
    

    Now the native data.frame class has its own special print() method called print.data.frame().

    print.data.frame
    #> function (x, ..., digits = NULL, quote = FALSE, right = TRUE, row.names = TRUE, max = NULL) 
    #> {
    #>     n <- length(row.names(x))
    #>     ⋮
    #>     invisible(x)
    #> }
    #> <bytecode: 0x000002429186b7e0>
    #> <environment: namespace:base>
    

    So when UseMethod() seeks a matching ("print") method, it finds print.data.frame() ready and waiting! It is the print.data.frame() function that actually handles the printing for the data.frame.

    More generally, a generic function like fn()...

    fn <- function(x, ...) {
      UseMethod("fn")
    }
    

    can be implemented for a (S3) class like cls, with a function of the form fn.cls():

    fn.cls <- function(x, arg_1, arg_2, arg_3, ...) {
      # ...
    } 
    

    Note

    The fn.default() method handles fn() for unimplemented classes. So in the absence of a print.cls() function, then UseMethod() would dispatch a cls object to print.default():

    print.default
    #> function (x, digits = NULL, quote = TRUE, na.print = NULL, print.gap = NULL, right = FALSE, max = NULL, width = NULL, useSource = TRUE, ...) 
    #> {
    #>     args <- pairlist(digits = digits, quote = quote, na.print = na.print, ...
    #>     ⋮
    #>     .Internal(print.default(x, args, missings))
    #> }
    #> <bytecode: 0x0000024291917b80>
    #> <environment: namespace:base>
    

    Approach

    By defining a custom S3 class called hst_obj — "historical object" — I override the "generic" behavior of dplyr::filter()...

    dplyr::filter
    #> function (.data, ..., .preserve = FALSE) 
    #> {
    #>     UseMethod("filter")
    #> }
    #> <bytecode: 0x0000024292d10b40>
    #> <environment: namespace:dplyr>
    

    ...which is designed to dispatch via UseMethod("filter"). To that end, I implement the function filter.hst_obj():

    filter.hst_obj
    #> function (.data, ..., .preserve = FALSE) 
    #> {
    #>     .update_hst(x = `class<-`(dplyr::filter(.data = un_hst_obj(.data, ...
    #> }
    #> <bytecode: 0x000002428f842958>
    

    When you call dplyr::filter() on a hst_obj object, then filter.hst_obj() jumps into action! Whenever it filters the object, it also records the filtration criteria in the special attribute obj_hst, which maintains the "object history".

    This history is a tibble...

    # A tibble: m × 4
       step order expr       text            
      <int> <int> <list>     <chr>           
    1     1     1 <language> sepal.length > 6
    ⋮      ⋮     ⋮      ⋮             ⋮
    

    ...which has four columns:

    • step: The filter() step in the workflow.
    iris %>%                              # step
      filter(Sepal.Length > 6) %>%        # } 1
      filter(Species == 'virginica') %>%  # } 2
      ...                                 #   ⋮
    
    • order: The criterion within the filter() step.
         filter(a < 10, b == 3 | c > 5, ...)
    #           |----|  |------------|
    # order:       1           2        ...
    
    • expr: The actual code (language) for the criterion (Sepal.Length > 6), useful for programmatic manipulation of R.
    • text: A textual (character) representation of that code ("Sepal.Length > 6"), for visual clarity.

    Usage

    You'll want to load dplyr itself, and then source() the module (mod.R) from (say) your working directory.

    # Load the `dplyr` package...
    library(dplyr)
    
    # ...along with the `hst_obj` functions from the module:
    source("./mod.R")
    

    Warning

    The modular function filter.hst_obj() must be loaded into the same workspace where you use dplyr::filter(). Per the documentation

    UseMethod...search[es] for methods in two places: in the environment in which the generic function is called, and in the registration data base for the environment in which the generic is defined (typically a namespace). So methods for a generic function need to be available in the environment of the call to the generic, or they must be registered.


    Here is a simple workflow on the iris dataset.

    iris %>%
      filter(Sepal.Length > 7, Sepal.Width <= 3) %>%
      filter(Petal.Width > 2)
    #> Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    #> 1          7.1         3.0          5.9         2.1 virginica
    #> 2          7.6         3.0          6.6         2.1 virginica
    #> 3          7.7         2.6          6.9         2.3 virginica
    #> 4          7.7         3.0          6.1         2.3 virginica
    

    Now we transform the dataset into a "historical object" called iris_hst, via as_hst_obj().

    iris_hst <- as_hst_obj(iris)
    

    Per is_hst_obj(), it is indeed a historical object.

    iris_hst %>% is_hst_obj()
    #> TRUE
    

    However, its history via get_hst() is still blank.

    iris_hst %>% get_hst()
    #> # A tibble: 0 × 4
    #> # … with 4 variables: step <int>, order <int>, expr <list>, text <chr>
    

    We now perform the same workflow on the historical dataset iris_hst...

    iris_hst <- iris_hst %>%
      filter(Sepal.Length > 7, Sepal.Width <= 3) %>%
      filter(Petal.Width > 2)
    

    ...which yields a consistent output.

    iris_hst
    #> Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    #> 1          7.1         3.0          5.9         2.1 virginica
    #> 2          7.6         3.0          6.6         2.1 virginica
    #> 3          7.7         2.6          6.9         2.3 virginica
    #> 4          7.7         3.0          6.1         2.3 virginica
    

    Crucially, we can now access the history via get_hst():

    iris_hst %>% get_hst()
    #> # A tibble: 3 × 4
    #>    step order expr       text            
    #>   <int> <int> <list>     <chr>           
    #> 1     1     1 <language> Sepal.Length > 7
    #> 2     1     2 <language> Sepal.Width <= 3
    #> 3     2     1 <language> Petal.Width > 2 
    

    We can also "reset" the history via reset_hst(), which clears the tibble of historical data.

    iris_hst <- iris_hst %>% reset_hst()
    
    iris_hst %>% get_hst()
    #> # A tibble: 0 × 4
    #> # … with 4 variables: step <int>, order <int>, expr <list>, text <chr>
    

    Finally, we can revert to an "unhistorical" object via un_hst_obj(), which removes the hst_obj classification and deletes the obj_hst attribute:

    iris_unhst <- iris_hst %>% un_hst_obj()
    
    # It is no longer a "historical" object...
    iris_unhst %>% is_hst_obj()
    #> FALSE
    
    # ...and the history is nonexistent (not merely blank) entirely.
    iris_unhst %>% get_hst()
    #> 
    

    Module

    Here is the module. I recommend saving it locally, as (say) mod.R in (say) your working directory. I also recommend the box package, which can load such modules painlessly via box::use(./mod).

    #########
    ## API ##
    #########
    
    # Test if an object is "historical object" whose filtrations are recorded.
    is_hst_obj <- function(x) {
      inherits(x, .HST_OBJ_CLASS)
    }
    
    
    # Treat an object as "historical".
    as_hst_obj <- function(x) {
      if (!is_hst_obj(x)) {
        class(x) <- c(.HST_OBJ_CLASS, class(x))
      }
      
      x
    }
    
    
    # Erase the "historicity" of an object.
    un_hst_obj <- function(x, hst = TRUE) {
      if (is_hst_obj(x)) {
        org_class <- class(x)
        class(x) <- org_class[org_class != .HST_OBJ_CLASS]
        
        if (isTRUE(hst)) {
          x <- .set_hst(x, hst = NULL)
        }
      }
      
      x
    }
    
    
    # Get the history from a historical object.
    get_hst <- function(x) {
      hst <- attr(x, .OBJ_HST_ATTR)
      
      if (is.null(hst)) {
        if (is_hst_obj(x)) {
          .BLANK_OBJ_HST
          # NULL
        } else {
          invisible(NULL)
        }
      } else {
        hst
      }
    }
    
    
    # Reset the history for a historical object.
    reset_hst <- function(x) {
      if (is_hst_obj(x)) {
        x <- .set_hst(x, hst = NULL)
      }
      
      x
    }
    
    
    
    ##############
    ## Dispatch ##
    ##############
    
    # Dispatch filtration for historical objects.
    filter.hst_obj <- evalq(envir = new.env(), {
      # Define the filtration function: `dplyr::filter()`
      fn_expr <- quote(dplyr::filter)
      #                ^^^^^^^^^^^^^
      #                 UPDATE HERE
      fn <- eval(fn_expr)
      
      # Replicate in our result the signature of that original function.
      arg_syms <- as.list(args(fn))
      arg_syms <- utils::head(arg_syms, n = -1)
      arg_syms <- sapply(names(arg_syms), as.symbol, USE.NAMES = TRUE)
      
      
      # Prepare the elements for the function body...
      obj_sym <- arg_syms[[1]]   # The (1st) argument (.data) for the object...
      cnd_exprs <- arg_syms$...  # ...and dots (...) for filtration condition(s).
      
      # ...including a similar call to the filter with an "ahistorical" object...
      arg_syms[[as.character(obj_sym)]] <- substitute(un_hst_obj(
        obj_sym,
        hst = FALSE
      ))
      fn_call <- as.call(c(list(fn_expr), arg_syms))
      
      sub_list <- list(
        obj = obj_sym,
        cnd = cnd_exprs,
        cll = fn_call
      )
      
      # ...and assemble those elements.
      fn_body <- substitute(env = sub_list, quote({
        .update_hst(
          # Perform the unclassed call and then restore any "historicity"...
          x = `class<-`(cll, class(obj)),
          # ...and then update the history with the filtration criteria.
          exprs = match.call(expand.dots = FALSE)$cnd
        )
      }))
      
      
      # Pair this body with the header from `dplyr::filter()`...
      fn_body <- eval(fn_body)
      body(fn) <- fn_body
      
      # ...and transfer the resulting function to the calling environment.
      environment(fn) <- parent.frame(n = 2)
      
      
      # Return the resulting function.
      fn
    })
    
    
    
    #############
    ## Support ##
    #############
    
    # Labels for the object class...
    .HST_OBJ_CLASS <- "hst_obj"
    
    # ...and its history attribute.
    .OBJ_HST_ATTR <- "obj_hst"
    
    # The default history for an object.
    .BLANK_OBJ_HST <- dplyr::tibble(
      step = integer(),
      order = integer(),
      expr = list(),
      text = character()
    )
    
    
    # Set the history for a historical object.
    .set_hst <- function(x, hst) {
      attr(x, .OBJ_HST_ATTR) <- hst
      x
    }
    
    # Update the history with a list of filtration expressions.
    .update_hst <- function(x, exprs) {
      # Augment the history of a "historical" object.
      if (is_hst_obj(x)) {
        # Get the current history.
        hst <- get_hst(x)
        
        # # ...and default if the history is missing.
        # if (is.null(hst)) {
        #   hst <- .BLANK_OBJ_HST
        # }
        
        
        # Augment the history: format the new additions...
        next_cnd <- exprs
        # next_cnd <- sapply(next_cnd, as.expression, simplify = FALSE)
        next_txt <- sapply(next_cnd, deparse, simplify = TRUE)
        next_ord <- seq_along(next_cnd)
        
        if (length(exprs) == 0) {
          next_stp <- integer()
        } else if (nrow(hst) == 0) {
          next_stp <- 1
        } else {
          next_stp <- max(hst$step) + 1
        }
        
        next_hst <- dplyr::tibble(
          step = as.integer(next_stp),
          order = as.integer(next_ord),
          expr = as.list(next_cnd),
          text = as.character(next_txt)
        )
        
        # ...and append them to the existing history.
        hst <- dplyr::bind_rows(hst, next_hst)
        
        
        # Update the history.
        x <- .set_hst(x, hst = hst)
      }
      
      # Return the updated object.
      x
    }