I have been trying to make a conditional sum based on a data.frame
that has duplicates. I want to sum the ones that has an identical permno and date and create a separate column with this information filling in NA's or preferable 0's.
My data set looks like this:
data.frame(crsp)
permno date PAYDT DISTCD divamt FACPR FACSHR PRC RET
1 10022 19280929 19281001 1272 0.25 0 0 71.00 0.045208
2 10022 19280929 19281001 1232 1.00 0 0 71.00 0.045208
3 10022 19281031 NA NA NA NA NA 73.50 0.035211
4 10022 19281130 NA NA NA NA NA 72.50 -0.013605
5 10022 19281231 19290202 1232 1.00 0 0 68.00 -0.044828
6 10022 19281231 19290202 1272 0.25 0 0 68.00 -0.044828
7 10022 19290131 NA NA NA NA NA 73.75 0.084559
8 10022 19290228 NA NA NA NA NA 69.00 -0.064407
9 10022 19290328 19290401 1232 1.00 0 0 65.00 -0.039855
10 10022 19290328 19290401 1272 0.25 0 0 65.00 -0.039855
11 10022 19290430 NA NA NA NA NA 67.00 0.030769
12 10022 19290531 NA NA NA NA NA 64.75 -0.033582
First, I have created permno + date to make a unique pickup-code
crsp$permnodate = paste(as.character(crsp$permno),as.character(crsp$date),sep="")
Second, I have then tried to sum the duplicates and making this into a new frame:
crsp_divsingl <- aggregate(crsp$divamt, by = list(permnodate = crsp$permnodate), FUN = sum, na.rm = TRUE)
However, I am unable to transfer this information back correctly to the original data.frame(crsp)
, as the columns have different lenghts where cbind
and cbind.fill
don't allow me to match this correctly. Specifically, I want the sum of the divamts for one/the first of the unique permnodates so it corresponds with the remaining data.frame
in length. I have not had succed with merge
or match
either.
I haven't tried loop functions yet or managed to create any if
or ifelse
functions with succes. Basically, this can be done in excel with the VLOOKUP or the index.match formula, however, this is more tricky in R than I first thought.
Help is much appreciated.
Best regards
Troels
You can use duplicated
and merge
to achieve this more easily. I've written an example. You'll have to alter this for your purposes, but hopefully it will put you on the right track:
# Creating a fake sample dataset.
set.seed(9)
permno <- 10022:10071 # Allowing 50 possible permno's.
date <- 19280929:19280978 # Allow 50 possible dates.
value <- c(NA, 1:9) # Allowing NA or a 0 through 9 value.
# Creating fake data frame.
crsp <- data.frame(permno = sample(permno, 1000, TRUE), date = sample(date, 1000, TRUE), value = sample(value, 1000, TRUE))
# Loading a function that uses duplicated to get both the duplicated rows and the original rows.
fullDup <- function(x) {
bool <- duplicated(x) | duplicated(x, fromLast = TRUE)
return(bool)
}
# Getting the duplicated rows.
crsp.dup <- crsp[fullDup(crsp[, c("permno", "date")]), ] # fullDup returns a boolean of all the rows that were duplicated to another row by permno and date including the first row.
# Now aggregate.
crsp.dup[is.na(crsp.dup)] <- 0 # Converting NA values to 0.
crsp.dup <- aggregate(value ~ permno + date, crsp.dup, sum)
names(crsp.dup)[3] <- "value.dup" # Changing the name of the value column.
# Now merge back in with the original dataset.
crsp <- merge(crsp, crsp.dup, by = c("permno", "date"), all.x = TRUE)