Search code examples
rcluster-analysisdummy-variabledummy-data

How to create an automated range for dummy in R?


I have the followinf DF and I want to create a dummy with automated scale to represent categorically whether a city has little, medium, or a lot of companies.

cities sum of companies
CTY A 199
CITY B 358
CITY C 250
CITY D 1265
CITY E 610

I tried the following code:

#install.packages("scales")
library(scales)

    COMP_SCALES<- breaks_extended() #from packages Scales
    COMP_A<-COMP_SCALES(df[2], n =4)
    COMP_A <- cut(df[2], 
                          breaks=c(-Inf, COMP_A[2],COMP_A[3],COMP_A[4], Inf), 
                          labels=c("LITTLE","MEDIUM","A LOT OF","+ A LOT OF"))

However, the automatic calculated scale is not very suitable, once all the cities are on little range. How can I better automate this code?

The final porpuse is to create a table to better visualize the result with something like this:

COMP_A_CLUSTER <- as.data.frame.matrix(table(COMP_A,kmeans.k$cluster))

Expected outcome: City A Should be placed on the "Little". City B and C Should be placed on the "Medium". City E Should be placed on the "a lot of". City D should be placed on the "+ a lot of".

I have a list of more than 10,000 cities and more than 100 columns to do such a similar process and that is why I wanted the scale of the dummies to be calculated automatically.


Solution

  • You can write your own functions if you know what are the end (right) boundaries of each of the categories. Below is a simple example. DF has a new column 'CatCities' and has what you are seeking.

    Following assumptions are there

    • The lowest value, for sum.of.companies, is greater than or equal to 0
    • The highest value, for sum.of.companies, is 10000 (You can change it)
    • 'CategoryList' in the function argument is strictly increasing (from lowest to highest) and so is the argument 'EndPoints'
    • The length of the vectors for the arguments, 'CategoryList' and 'EndPoints', are equal in the function call
    DF <- read.csv("./SomeDF.csv")
    ClassifyRange <- function(x, CategoryList=c("Little","Medium","a lof of","+a lot of"),EndPoints=c(250,500,1000,10000)){
      Index <- which((EndPoints -x) >= 0)
      return(CategoryList[Index[1]])
    }
    
    DF$CatCities <- lapply(DF$sum.of.companies, FUN=ClassifyRange)
    

    It produces the following output

    Output of the function. The right most column is the category of the cities