I have two data frames with spectral bands from a satellite, redDF
and nirDF
. Both data frames have values per date column starting with an 'X', these names correspond in both data frames.
I want to get a new data frame where for each column starting with an 'X' in both redDF
and nirDF
a new value is calculated according to some formula.
Here is a data sample:
library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
as.Date("2015-09-20", "%Y-%m-%d"),
by="16 days") %>%
format(., "%Y-%m-%d") %>%
paste0("X", .)
# sample data values
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices
# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
mydata[sample(1:length(mydata))],
mydata[sample(1:length(mydata))]) %>%
t() %>%
as.data.frame(., row.names = FALSE) %>%
rename_with(~datecolnames) %>%
mutate(id = row_number()+1142) %>%
select(id, everything())
# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>%
mutate(across(-id,~as.integer(.x+20*1.8))) %>%
select(id, everything())
> nirDF
id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143 NA 645 NA 636 569 841 706
2 1144 1025 NA 706 569 354 NA NA
3 1145 904 636 706 645 NA NA 115
X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1 115 1025 904 NA 409 354
2 115 636 409 645 841 904
3 569 409 354 841 1025 NA
and this is the formula:
getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)}
I hoped I would be able to do something like:
ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))
But that doesn't work, as dplyr
doesn't know what the nir
argument of getNDVI
should be. I have seen solutions for accessing other data frames in mutate()
by using the $COLNAME
indexer, but since I have 197 columns, that is not an option here.
In its most basic form, you can just do this:
round((nirDF - redDF)/(nirDF + redDF), digits = 4)
But this does not retain the id-column and can break if some columns are not numeric. A more failsafe version would be:
red <- redDF %>%
arrange(id) %>% # be sure to apply the same order everywhere
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric)) # be sure to have numeric columns
nir <- nirDF %>% arrange(id) %>%
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric))
# make sure that the number of rows are equal
if(nrow(red) == nrow(nir)){
ndvi <- redDF %>%
# get data.frame with ndvi values
transmute(round((nir - red)/(nir + red), digits = 4)) %>%
# bind id-column and possibly other columns to the data frame
bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>%
# place the id-column to the front
select(!starts_with('X'), everything())
}
As far as I have understood dplyr
by now, it boils down to this:
across
is (generally) meant for many-to-many relationships, but handles columns on an individual basis by default. So, if you give it three columns, it will give you three columns back which are not aware of the values in other columns.c_across
on the other hand, can evaluate relationships between columns (like a sum or a standard deviation) but is meant for many-to-one relationships. In other words, if you give it three columns, it will give you one column back.Neither of these is suitable for this task. However, by design, arithmetic operations can be applied to data frames in R (just try cars*cars
for instance). This is what we need in this case. Luckily, these operations are not as greedy as dplyr join operations, so they can be done efficiently on large data frames.
While doing so, you need to keep some requirements into account:
numeric
or integer
).