Search code examples
rdata.tablerowsnasubtraction

Subtract rows with numeric values and ignore NAs


I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.

I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.

I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).

However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)

Here's an example df for illustration and a solution I tried:

library(data.table)

df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))

that's how it works for a single column

 diff_x <- df[!is.na(x), lag_diff := x - shift(x)]  # actually what I want, but for more columns at once

and that's how I apply a diff function over several columns with lapply

diff_all <- setDT(df)[,lapply(.SD, diff)]  # not exactly what I want because NAs are not ignored and  the difference between numeric values is not calculated 

I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.


Solution

  • So you are looking for:

    library("data.table")
    
    df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
    setDT(df)
    # diff_x <- df[!is.na(x), lag_diff := x - shift(x)]  # actually what I want, but
    
    lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
    df[, lapply(.SD, lag_d)]
    

    or

    library("data.table")
    
    df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
    lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
    as.data.frame(lapply(df, lag_d))