I have a 640 x 2500 dataframe with numeric values and several NA
values. My goal is to find a minimum of 75 consecutive NA
values in each row. For each such run, I want to replace the previous and following 50 cells with NA
values too.
Here's a scaled down example of one row:
x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
# run of four NA: ^ ^ ^ ^
I want to detect the run of four consecutive NA
, and then replace three values before and three values after the run with NA
:
c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3)
# ^ ^ ^ ^ ^ ^
I have tried to first identify the consecutive NA
s with rle
, but running rle(is.na(df))
gives the error 'x' must be a vector of an atomic type
. This occurs even when I select a single row.
Unfortunately, I do not know what the next steps to take would be in converting the previous and following 50 cells to NA.
Would highly appreciate any help on this, thanks in advance.
Because you comment that in your data "some [rows] begin and end with several NA
s", hopefully this better represents the real data:
A B C D E F G H I J
1 1 2 3 NA NA 6 7 8 NA 10
2 1 NA NA NA 5 6 7 NA NA NA
3 1 2 3 4 NA NA NA 8 9 10
Let's assume that the minimum run length of NA
to be expanded with NA
is 2, and that two values before and two values after the run should be replaced with NA
. In this example, row 2 would represent the case you mentioned in comment.
First some data wrangling. I prefer to work with a data.table
in long format. With data.table
we have access to the useful constants .I
and .N
, and can easily create run IDs with rleid
.
# convert data.frame to data.table
library(data.table)
setDT(d)
# set minimum length of runs to be expanded
len = 2L
# set number of values to replace on each side of run
n = 2L
# number of columns of original data (for truncation of indices)
nc = ncol(d)
# create a row index to keep track of the original rows in the long format
d[ , ri := 1:.N]
# melt from wide to long format
d2 = melt(d, id.vars = "ri")
# order by row index
setorder(d2, ri)
Now the actual calculations on the runs and their indices:
d2[
# check if the run is an "NA run" and has sufficient length
d2[ , if(anyNA(value) & .N >= len){
# get indices before and after run, where values should be changed to NA
ix = c(.I[1] - n:1L, .I[.N] + 1L:n)
# truncate indices to keep them within (original) rows
ix[ix >= 1 + (ri - 1) * nc & ix <= nc * ri]},
# perform the calculation by row index and run index
# grab replacement indices
by = .(ri, rleid(is.na(value)))]$V1,
# at replacement indices, set value to NA
value := NA]
If desired, cast back to wide format
dcast(d2, ri ~ variable, value.vars = "value")
# ri A B C D E F G H I J
# 1: 1 1 NA NA NA NA NA NA 8 NA 10
# 2: 2 NA NA NA NA NA NA NA NA NA NA
# 3: 3 1 2 NA NA NA NA NA NA NA 10