Search code examples
mysqlrfinance

Conditional sum on data.frame based on duplicates


I have been trying to make a conditional sum based on a data.framethat has duplicates. I want to sum the ones that has an identical permno and date and create a separate column with this information filling in NA's or preferable 0's.

My data set looks like this:

data.frame(crsp)

    permno     date    PAYDT DISTCD divamt FACPR FACSHR   PRC       RET
1   10022 19280929 19281001   1272   0.25     0      0 71.00  0.045208
2   10022 19280929 19281001   1232   1.00     0      0 71.00  0.045208
3   10022 19281031       NA     NA     NA    NA     NA 73.50  0.035211
4   10022 19281130       NA     NA     NA    NA     NA 72.50 -0.013605
5   10022 19281231 19290202   1232   1.00     0      0 68.00 -0.044828
6   10022 19281231 19290202   1272   0.25     0      0 68.00 -0.044828
7   10022 19290131       NA     NA     NA    NA     NA 73.75  0.084559
8   10022 19290228       NA     NA     NA    NA     NA 69.00 -0.064407
9   10022 19290328 19290401   1232   1.00     0      0 65.00 -0.039855
10  10022 19290328 19290401   1272   0.25     0      0 65.00 -0.039855
11  10022 19290430       NA     NA     NA    NA     NA 67.00  0.030769
12  10022 19290531       NA     NA     NA    NA     NA 64.75 -0.033582

First, I have created permno + date to make a unique pickup-code

crsp$permnodate = paste(as.character(crsp$permno),as.character(crsp$date),sep="") 

Second, I have then tried to sum the duplicates and making this into a new frame:

crsp_divsingl <- aggregate(crsp$divamt, by = list(permnodate = crsp$permnodate), FUN = sum, na.rm = TRUE)

However, I am unable to transfer this information back correctly to the original data.frame(crsp), as the columns have different lenghts where cbind and cbind.fill don't allow me to match this correctly. Specifically, I want the sum of the divamts for one/the first of the unique permnodates so it corresponds with the remaining data.frame in length. I have not had succed with merge or match either.

I haven't tried loop functions yet or managed to create any if or ifelse functions with succes. Basically, this can be done in excel with the VLOOKUP or the index.match formula, however, this is more tricky in R than I first thought.

Help is much appreciated.

Best regards

Troels


Solution

  • You can use duplicated and merge to achieve this more easily. I've written an example. You'll have to alter this for your purposes, but hopefully it will put you on the right track:

    # Creating a fake sample dataset.
    set.seed(9)
    permno <- 10022:10071 # Allowing 50 possible permno's. 
    date <- 19280929:19280978 # Allow 50 possible dates.
    value <- c(NA, 1:9) # Allowing NA or a 0 through 9 value.
    
    # Creating fake data frame.
    crsp <- data.frame(permno = sample(permno, 1000, TRUE), date = sample(date, 1000, TRUE), value = sample(value, 1000, TRUE))
    
    # Loading a function that uses duplicated to get both the duplicated rows and the original rows.
    fullDup <- function(x) {
    
      bool <- duplicated(x) | duplicated(x, fromLast = TRUE)
      return(bool)
    
    }
    
    # Getting the duplicated rows.
    crsp.dup <- crsp[fullDup(crsp[, c("permno", "date")]), ] # fullDup returns a boolean of all the rows that were duplicated to another row by permno and date including the first row.
    
    # Now aggregate.
    crsp.dup[is.na(crsp.dup)] <- 0 # Converting NA values to 0.
    crsp.dup <- aggregate(value ~ permno + date, crsp.dup, sum)
    names(crsp.dup)[3] <- "value.dup" # Changing the name of the value column.
    
    # Now merge back in with the original dataset.
    crsp <- merge(crsp, crsp.dup, by = c("permno", "date"), all.x = TRUE)