Search code examples
rdplyrtidyevalnse

R dplyr programatically identify column


For some objects an attribute identifies a special column, for example the geometry column in an sf object. For conducting some calculations in dplyr it would be good to easily identify these columns. I'm searching for a way to create a function that helps identifying this column. In the example below I can make a function that identifies this column but I still need to use the rlang splice operator (!!!).

require(sf)
require(dplyr)
n<-4
df = st_as_sf(data.frame(x = 1:n, y = 1:n, cat=gl(2,2)), coords = 1:2, crs = 3857) %>% group_by(cat)
# this is the example I start from however the geometry column is not guaranteed to have that name
df %>% mutate(d=st_distance(geometry, geometry[row_number()==1]))
#> Simple feature collection with 4 features and 2 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 1 ymin: 1 xmax: 4 ymax: 4
#> Projected CRS: WGS 84 / Pseudo-Mercator
#> # A tibble: 4 × 3
#> # Groups:   cat [2]
#>   cat      geometry d[,1]
#> * <fct> <POINT [m]>   [m]
#> 1 1           (1 1)  0   
#> 2 1           (2 2)  1.41
#> 3 2           (3 3)  0   
#> 4 2           (4 4)  1.41
# this works, however the code does not get easier to read
df %>% mutate(d=st_distance(!!!syms(attr(., "sf_column")), (!!!syms(attr(., "sf_column")))[row_number()==1]))
#> Simple feature collection with 4 features and 2 fields
#> ...
#> 4 2           (4 4)  1.41
# this works and is already better:
geometry_name<-function(x) syms(attr(x, 'sf_column'))
df %>% mutate(d=st_distance(!!!geometry_name(.), (!!!geometry_name(.))[row_number()==1]))
#> Simple feature collection with 4 features and 2 fields
#> ...  
#> 4 2           (4 4)  1.41

Ideally I would like to find a function that makes the following code work as this would be easiest for users:

df %>% mutate(d=st_distance(geometry_name(), geometry_name()[row_number()==1]))

Solution

  • Calling this kind of function without arguments requires that you assume symbols are present in the calling frame (in this case the . placeholder and the .data pronoun), so it won't work well outside of dplyr verbs, but if that suits your workflow, then you can do:

    geometry_name <- function() {
      .data <- eval(quote(.data), parent.frame())
      nms <- names(eval(quote(.), parent.frame()))
      geo <- which(sapply(nms, function(x) inherits(.data[[x]], 'sfc')))
      if(length(geo) == 0) {
        stop('No geometry column detected')
      }
      if(length(geo) > 1) {
        warning('More than one geometry column. Only the first will be used.')
        geo <- geo[1]
      }
      .data[[nms[geo]]]
    }
    

    Using your example, this allows you to use your specified syntax:

    df %>% 
      mutate(d = st_distance(geometry_name(), geometry_name()[row_number()==1]))
    #> Simple feature collection with 4 features and 2 fields
    #> Geometry type: POINT
    #> Dimension:     XY
    #> Bounding box:  xmin: 1 ymin: 1 xmax: 4 ymax: 4
    #> Projected CRS: WGS 84 / Pseudo-Mercator
    #> # A tibble: 4 x 3
    #> # Groups:   cat [2]
    #>   cat      geometry d[,1]
    #> * <fct> <POINT [m]>   [m]
    #> 1 1           (1 1)  0   
    #> 2 1           (2 2)  1.41
    #> 3 2           (3 3)  0   
    #> 4 2           (4 4)  1.41
    

    You could potentially make the function a bit more useful by allowing it to take a data argument, which if missing runs the above code (after checking for the presence of . and .data), but otherwise just finding and returning the sf column from data. This would allow use outside of dplyr verbs, but preserve the desired behaviour inside dplyr.

    For example:

    geometry_name <- function(data) {
      if(missing(data)) {
        .data <- tryCatch( { 
          eval(quote(.data), parent.frame())
        }, error = function(e){ 
          stop("Argument 'data' missing, with no default")
        })
        plchlder <- tryCatch({
          eval(quote(.), parent.frame())
        }, error = function(e) {
          stop("geometry_name can only be used without a 'data' argument ",
               "inside dplyr verbs")
        })
        nms <- names(plchlder)
        geo <- which(sapply(nms, function(x) inherits(.data[[x]], 'sfc')))
        if(length(geo) == 0) {
          stop('No geometry column detected')
        }
        if(length(geo) > 1) {
          warning('More than one geometry column. Only the first will be used.')
          geo <- geo[1]
        }
        return(.data[[nms[geo]]])
      }
      
      geo <- which(sapply(data, function(x) inherits(x, 'sfc')))
      if(length(geo) == 0) stop('No geometry column detected')
      if(length(geo) > 1) {
        warning('More than one geometry column. Only the first will be used.')
        geo <- geo[1]
      }
      return(data[[geo]])
    }
    

    Which gives the following behaviour

    geometry_name(df)
    #> [1] "geometry"
    
    geometry_name()
    #> Error in value[[3L]](cond) : 
    #>   geometry_name can only be used without a 'data' argument inside 
    #>   dplyr verbs
    
    df %>% 
      mutate(d = st_distance(geometry_name(), geometry_name()[row_number()==1]))
    #> Simple feature collection with 4 features and 2 fields
    #> Geometry type: POINT
    #> Dimension:     XY
    #> Bounding box:  xmin: 1 ymin: 1 xmax: 4 ymax: 4
    #> Projected CRS: WGS 84 / Pseudo-Mercator
    #> # A tibble: 4 x 3
    #> # Groups:   cat [2]
    #>   cat      geometry d[,1]
    #> * <fct> <POINT [m]>   [m]
    #> 1 1           (1 1)  0   
    #> 2 1           (2 2)  1.41
    #> 3 2           (3 3)  0   
    #> 4 2           (4 4)  1.41