Search code examples
rdata.tableeuclidean-distancerowwise

Euclidean distant for distinct classes of factors iterated by groups


*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!

I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:

formula

Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).

To further illustrate this, consider the following dataset:

df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
                   Patent_Number = sample(184785:194785,12,replace = FALSE),
                   Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
    Firm Year Patent_Number Patent_Class
 1:    A 1990        192473           12
 2:    A 1990        193702            5
 3:    A 1991        191889           31
 4:    A 1992        193341           12
 5:    A 1992        189512           31
 6:    A 1993        185582            6
 7:    B 1990        190838           15
 8:    B 1990        189322           15
 9:    B 1991        190620           15
10:    B 1992        193443            3
11:    B 1992        189937            3
12:    B 1993        194146            1

Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:

formula

Following the same calculation and reiterating over firms, the final output should be:

> df
    Firm Year Patent_Number Patent_Class   El_Dist
 1:    A 1990        192473           12        NA
 2:    A 1990        193702            5        NA
 3:    A 1991        191889           31 1.2247450
 4:    A 1992        193341           12 0.7071068
 5:    A 1992        189512           31 0.7071068
 6:    A 1993        185582            6 1.2247450
 7:    B 1990        190838           15        NA
 8:    B 1990        189322           15        NA
 9:    B 1991        190620           15 0.5000000
10:    B 1992        193443            3 1.1180340
11:    B 1992        189937            3 1.1180340
12:    B 1993        194146            1 1.1180340

I'm preferably looking for a data.table solution for speed purposes.

Thank you very much in advance for any help.


Solution

  • I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.

    fEl_Dist <- function(X){
      Year <- X[["Year"]]
      PatentClass <- X[["Patent_Class"]]
      sapply(seq_along(Year), function(i){
        j <- which(Year %in% (Year[i] - 1:0))
        tbl <- table(Year[j], PatentClass[j])
        if(NROW(tbl) == 1){
          NA_real_
        } else {
          numer <- sum((tbl[2, ] - tbl[1, ])^2)
          denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
          sqrt(numer/denom)
        }
      })
    }
      
    setDT(df)[, El_Dist := fEl_Dist(.SD), 
              by = .(Firm),
              .SDcols = c("Year", "Patent_Class")]
    
    head(df)
    #   Firm Year Patent_Number Patent_Class   El_Dist
    #1:    A 1990        190948           12        NA
    #2:    A 1990        186156            5        NA
    #3:    A 1991        190801           31 1.2247449
    #4:    A 1992        185226           12 0.7071068
    #5:    A 1992        185900           31 0.7071068
    #6:    A 1993        186928            6 1.2247449