Search code examples
rmaxdummy-variable

Maximum values of one column based on multiple other columns


this question is quite specific I guess.

I am looking for the R equivalent of Stata egen function, in particular

egen max BY varlist.

In my data table I have a column with ID

ID <- c(1,1,2,2,3,4,5,6,6)

a column with year of observation

year <- c(2000,2000,2001,2002,2002,2003,2004,2004,2004)

year and ID are not unique, because there can be more observations in one year.

Finally I have a column with names

names <- c("Mark",NA,"John","John",NA,"Sarah","Julia",NA,NA).

First of all I want to generate a dummy that is equal to 1 if name is not NA and I figured

dummy <-  ifelse(!is.na(names),1,0)

Then I want R to return the maximum of the dummy based on ID and year, which in Stata would be

egen MAX = max(dummy), by(ID year)

In practice, I want R to return 0 if names is NA for all rows with the same ID and year. So I would get (1,1,1,1,0,1,1,0,0) and I can proceed to drop rows 5,7,8. Thank you!


Solution

  • Here's an attempt using ave, but this logic will work with any grouping function you might know. Look for NAs by group, see if they are not (!/ Negate) all TRUE:

    ave(is.na(dat[["names"]]), dat[c("ID","year")], FUN=Negate(all))
    #[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
    

    Where dat was:

    dat <- data.frame(ID,year,names, stringsAsFactors=FALSE)