Search code examples
rdplyr

Count frequency based on multiple columns


I have multiple columns with VC's funds.

df <- data.frame(Company = c("X", "Y","Z"),
                 Valuation = c("10","20","30"),
                 VC.1 = c("Zeev Ventures","Bedrock Capital","Activant Capital"),
                 VC.2 = c("Bedrock Capital","Activant Capital","Zeev Ventures"))
Company Valuatiion VC.1 VC.2
X 10 Zeev Ventures Bedrock Capital
Y 20 Bedrock Capital Activant Capital
Z 30 Activant Capital Zeev Ventures

I need calcucalte different statistics for each VC fund. For example

VC Total Valuatiion Total Investments
Zeev Ventures 40 2
Bedrock Capital 40 2
Activant Capital 50 2

Solution

  • You can pivot the VC.1 and VC.2 columns and get your results with the usual dplyr functions:

    library(tidyr)
    library(dplyr)
    pivot_longer(df, cols = c("VC.1", "VC.2"), values_to = "VC") %>%
      group_by(VC) %>%
      summarise(TV = sum(as.numeric(Valuation)), TI = n())
    
    # A tibble: 3 × 3
      VC                  TV    TI
      <chr>            <dbl> <int>
    1 Activant Capital    50     2
    2 Bedrock Capital     30     2
    3 Zeev Ventures       40     2
    

    I recommend inspecting the result of the pivot_longer() function call alone for a better understanding of why this pivoting step solves the problem of having identifiers in multiple columns.