Search code examples
rvectorizationcase-when

R apply multiple functions when large number of categories/types are present using case_when (R vectorization)


Suppose I have a dataset of the following form:

City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))

My actual dataset contains about 200K rows. Furthermore, it contains information for over 100 cities. Suppose, for each city (which I also call "Type"), I have the following functions which need to be applied:

#Writing the custom functions for the categories here

Type1=function(full_data,observation){
  NewSet=full_data[which(!full_data$City==observation$City),]
  BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
  return(BusinessMax)
}

Type2=function(full_data,observation){
  NewSet=full_data[which(!full_data$City==observation$City),]
  BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
  return(BusinessMax)
}

Once again the above two functions are extremely simply ones that I use for illustration. The idea here is that for each City (or "Type") I need to run a different function for each row in my dataset. In the above two functions, I used rnorm in order to check and make sure that we are drawing different values for each row.

Now for the entire dataset, I want to first divide the observation into its different City (or "Types"). I can do this using (zz_new[["City"]]==1) [also see below]. And then run the respective functions for each classes. However, when I run the code below, I get -Inf.

Can someone help me understand why this is happening?

For the example data, I would expect to obtain 20 plus 10 times some random value (for Type =1) and 35 minus 100 times some random value (for Type=2). The values should also be different for each row since I am drawing them from a random normal distribution.

library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
  zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
  zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)

Thanks a lot in advance.


Solution

  • Let's take a look at your code. I rewrite your code

    library(dplyr)
    zz_new[,"AdjustedRevenue"] = case_when(
      zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
      zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
    )
    

    to

    zz_new %>%
      mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
                                         City == 2 ~ Type2(zz_new,zz_new)))
    

    since you are using dplyr but don't use the powerful tools provided by this package.

    Besides the usage of mutate one key change is that I replaced zz_new[,] with zz_new. Now we see that both arguments of your Type-functions are the same dataframe.

    Next step: Take a look at your function

    Type1 <- function(full_data,observation){
      NewSet=full_data[which(!full_data$City==observation$City),]
      BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
      return(BusinessMax)
    }
    

    which is called by Type1(zz_new,zz_new). So the definition of NewSet gives us

    NewSet=full_data[which(!full_data$City==observation$City),]
    
    # replace the arguments
    NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
    

    Thus NewSet is always a dataframe with zero rows. Applying max to an empty column of a data.frame yields -Inf.