Search code examples
rfilteringpanel

Filtering for only complete sets of years


I have data on yield organized by State and County. Out of this data I want to retain only those counties providing complete years between 1970 to 2000.

The following code clears away some incomplete cases, however fails to omit all the cases- especially with a larger data set. The fake data

Some fake data:

fake data

K <- 5 # number of rows set to NaN

df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
                 county = rep(1:4, 5), yield = 100)

df[sample(1:20, K), 3] <- NaN

Current code:

df1 <- read.csv("gly2.csv",header=TRUE)

df <- data.frame(df1)


droprows_1 <- function(df, v1, v2, v3, value = 'x'){
  idx <- df[, v3] == value
  todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
  todrop <- unique(todrop); todrop # but unique values could be less

  nrow <- dim(todrop)[1]
  for(i in 1:nrow){
    idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
    df <- df[!idx, ]
  }
  return(df)
}

qq <- droprows_1(df, 1, 2, 3)

thank you


Solution

  • To drop county's with a single missing value, use:

    library(dplyr)
    df %>% group_by(county) %>% filter( !any(is.nan(yield)))