Search code examples
rdplyracross

Apply function on data.frame with mutate across using the same columns from another data.frame


I have two data frames with spectral bands from a satellite, redDF and nirDF. Both data frames have values per date column starting with an 'X', these names correspond in both data frames. I want to get a new data frame where for each column starting with an 'X' in both redDF and nirDF a new value is calculated according to some formula.

Here is a data sample:

library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
           as.Date("2015-09-20", "%Y-%m-%d"),
           by="16 days") %>% 
  format(., "%Y-%m-%d") %>% 
  paste0("X", .)
# sample data values 
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices

# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
           mydata[sample(1:length(mydata))],
           mydata[sample(1:length(mydata))]) %>% 
  t() %>% 
  as.data.frame(., row.names = FALSE) %>% 
  rename_with(~datecolnames) %>% 
  mutate(id = row_number()+1142) %>% 
  select(id, everything())

# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>% 
  mutate(across(-id,~as.integer(.x+20*1.8))) %>% 
  select(id, everything())

> nirDF
    id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143          NA         645          NA         636         569         841         706
2 1144        1025          NA         706         569         354          NA          NA
3 1145         904         636         706         645          NA          NA         115
  X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1         115        1025         904          NA         409         354
2         115         636         409         645         841         904
3         569         409         354         841        1025          NA

and this is the formula:

getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)} 

I hoped I would be able to do something like:

ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))

But that doesn't work, as dplyr doesn't know what the nir argument of getNDVI should be. I have seen solutions for accessing other data frames in mutate() by using the $COLNAME indexer, but since I have 197 columns, that is not an option here.


Solution

  • In its most basic form, you can just do this:

    round((nirDF - redDF)/(nirDF + redDF), digits = 4)
    

    But this does not retain the id-column and can break if some columns are not numeric. A more failsafe version would be:

    red <- redDF %>% 
      arrange(id) %>%  # be sure to apply the same order everywhere
      select(starts_with('X')) %>%  
      mutate(across(everything(), as.numeric)) # be sure to have numeric columns 
    nir <- nirDF %>% arrange(id) %>% 
      select(starts_with('X')) %>%  
      mutate(across(everything(), as.numeric))
    
    # make sure that the number of rows are equal
    if(nrow(red) == nrow(nir)){
      ndvi <- redDF %>% 
        # get data.frame with ndvi values
        transmute(round((nir - red)/(nir + red), digits = 4)) %>% 
        # bind id-column and possibly other columns to the data frame
        bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>% 
        # place the id-column to the front
        select(!starts_with('X'), everything())
    }
    

    As far as I have understood dplyr by now, it boils down to this:

    • across is (generally) meant for many-to-many relationships, but handles columns on an individual basis by default. So, if you give it three columns, it will give you three columns back which are not aware of the values in other columns.
    • c_across on the other hand, can evaluate relationships between columns (like a sum or a standard deviation) but is meant for many-to-one relationships. In other words, if you give it three columns, it will give you one column back.

    Neither of these is suitable for this task. However, by design, arithmetic operations can be applied to data frames in R (just try cars*cars for instance). This is what we need in this case. Luckily, these operations are not as greedy as dplyr join operations, so they can be done efficiently on large data frames. While doing so, you need to keep some requirements into account:

    • The number of rows of the two data frames should be equal, otherwise, the shorter data frame will get recycled.
    • all columns in the data frame need to be of a numeric class (numeric or integer).