Search code examples
rdplyrcountweighted

Weighted count of technology classes per country and per year


I am confronted with this issue: I am trying to make a weighted count of technology classes per country and per year. I am departing from a dataframe like this:

library(dplyr)  
df <- tibble(
 id = c("01", "01", "02", "02", "02", "02", "03"), 
 year = c("1975", "1975", "1976", "1976", "1976", "1976", "1980"),
 country = c("US", "CA", "DE", "DE", "FR", "FR", "IT"),
 uspc_class = c("A", "A", "B", "C", "B", "C", "D"),
 fractional_count = c("0.5", "0.5", "0.5", "0.5", "0.5", "0.5", "1"))

where id is the id of the patent to which the uspc_class(es) are associated and produced by one or more countries.

I want to make a count for each uspc_class to see how many are attributable to each country in each year.

I am able to make the normal count with the following code:

df_count <- df %>%
  group_by(uspc_class, country, year) %>%
  dplyr::summarise(cc_ijt = n()) %>%
  ungroup()

and I get the count in the cc_ijt variable in the df_count dataframe. However, as in some cases there are multiple countries for the same id, I would like to take this into account to avoid double counting.

That is, the result I get with my code is a dataframe like this:

df_count <- tibble(
  uspc_class = c("A", "A", "B", "B", "C", "C", "D"), 
  country = c("CA", "US", "DE", "FR", "DE",  "FR", "IT"),
  year = c("1975", "1975", "1976", "1976", "1976", "1976", "1980"),
  cc_ijt = c("1", "1", "1", "1", "1", "1", "1"))

What I would get is instead something like this:

df_count <- tibble(
  uspc_class = c("A", "A", "B", "B", "C", "C", "D"), 
  country = c("CA", "US", "DE", "FR", "DE",  "FR", "IT"),
  year = c("1975", "1975", "1976", "1976", "1976", "1976", "1980"),
  cc_ijt = c("0.5", "0.5", "0.5", "0.5", "0.5", "0.5", "1"))

Where cc_ijt takes into account that the count of the uspc_class has to be weighted by the fractional_count.

How can I modify my code to do this? Thank you!


Solution

  • Instead of counting how many uspc_classes are attributable to each country in each year, I sum over the fractional_count column and solved the issue.

    This code did it:

    df_count <- df %>% 
    group_by(uspc_class, country, year) %>%
    dplyr::summarise(cc_ijt = sum(fractional_count)) %>%
    ungroup()