I have data on yield organized by State and County. Out of this data I want to retain only those counties providing complete years between 1970 to 2000.
The following code clears away some incomplete cases, however fails to omit all the cases- especially with a larger data set. The fake data
Some fake data:
K <- 5 # number of rows set to NaN
df <- data.frame(state = c(rep(1, 10), rep(2, 10)),
county = rep(1:4, 5), yield = 100)
df[sample(1:20, K), 3] <- NaN
Current code:
df1 <- read.csv("gly2.csv",header=TRUE)
df <- data.frame(df1)
droprows_1 <- function(df, v1, v2, v3, value = 'x'){
idx <- df[, v3] == value
todrop <- df[idx, c(v1, v2)]; todrop # should have K rows missng
todrop <- unique(todrop); todrop # but unique values could be less
nrow <- dim(todrop)[1]
for(i in 1:nrow){
idx <- apply(df, 1, function(x) all(x == todrop[i, ]))
df <- df[!idx, ]
}
return(df)
}
qq <- droprows_1(df, 1, 2, 3)
thank you
To drop county's with a single missing value, use:
library(dplyr)
df %>% group_by(county) %>% filter( !any(is.nan(yield)))