Search code examples
rcategorization

Change Numerical Data into Categorized Data


Here is the link of the dataset. I am trying to categorize my data. DR_AGE is working quite fine.

setwd("~/data1")
a2 <- read.csv("data1.csv")
dim(a2)
[1] 11503     7
names(a2)
[1] "CR_HOUR"  "adt"      "ln"       "pav"      "DR_AGE"   "NUM_OCC"  "VEH_YEAR"

## categorize DR_AGE

a2$DR_AGE[a2$DR_AGE < 25] <- "15-24"
a2$DR_AGE[a2$DR_AGE>24 & a2$DR_AGE < 35] <- "25-34"
a2$DR_AGE[a2$DR_AGE >34 & a2$DR_AGE < 45] <- "35-44"
a2$DR_AGE[a2$DR_AGE >44 & a2$DR_AGE < 55] <- "45-54"
a2$DR_AGE[a2$DR_AGE >54 & a2$DR_AGE < 65] <- "55-64"
a2$DR_AGE[a2$DR_AGE >64 & a2$DR_AGE < 75] <- "65-74"
a2$DR_AGE[a2$DR_AGE >74 ] <- "75 plus"
a2$DR_AGE <- factor(a2$DR_AGE)
table(a2[, "DR_AGE"])                 ## All categories are generated. 
  15-24   25-34   35-44   45-54   55-64   65-74 75 plus 
   2298    2118    1638    1526    1036     511     350 

But there's something wrong when I am trying to categorize CR_HOUR or VEH_YEAR.

## categorize CR_HOUR  
a2$CR_HOUR[a2$CR_HOUR < 7] <- "00-06"
a2$CR_HOUR[a2$CR_HOUR>6 & a2$CR_HOUR < 13] <- "07-12"
a2$CR_HOUR[a2$CR_HOUR >12 & a2$CR_HOUR < 19] <- "13-18"
a2$CR_HOUR[a2$CR_HOUR >18 ] <- "19-24"
a2$CR_HOUR <- factor(a2$CR_HOUR)
table(a2[, "CR_HOUR"])              ### "07-12" is not generated. ????

00-06    10    11    12 13-18 19-24 
 1234   303   338   378  4152  5096 

## categorize VEH_YEAR
a2$VEH_YEAR[a2$VEH_YEAR >1930 & a2$VEH_YEAR <1991] <- "1990 and Before"
a2$VEH_YEAR[a2$VEH_YEAR>1990 & a2$VEH_YEAR < 2001] <- "1991-2000"
a2$VEH_YEAR[a2$VEH_YEAR>2000 & a2$VEH_YEAR < 2011] <- "2001-2010"
a2$VEH_YEAR[a2$VEH_YEAR >2010] <- "2011 and After"
a2$VEH_YEAR<- factor(a2$VEH_YEAR)
table(a2[, "VEH_YEAR"])              ### "!990 and Before" is not generated. ????

     1991-2000      2001-2010 2011 and After 
          4842           4763             57 

I am struggling to fix the problem. Any help is appreciated.


Solution

  • The problem is that when you do

    a2$CR_HOUR[a2$CR_HOUR < 7] <- "00-06"
    

    you are assigning a character value to a numeric column. This causes the data type of CR_HOUR to change to character and messes with down steam comparisons. This is not an effective way to recode data. It would be better to create a new character vector for the categorical names and then add it to the data.frame or replace the current column when all the substitutions have been done.

    If you have ranges like this, the cut() command can be very useful. For example

    agebr<-c(14,24,34,44,54,64,74,Inf)
    a2$DR_AGE <-cut(a2$DR_AGE, breaks=agebr, 
        labels=paste(head(agebr,-1)+1, tail(agebr,-1), sep="-"))
    table(a2$DR_AGE)
    
    hourbr<-c(0,6,12,18,24)
    a2$CR_HOUR <- cut(a2$CR_HOUR, breaks=hourbr, 
         labels=paste(sprintf("%02d", ifelse(head(hourbr,-1)>0,head(hourbr,-1)+1,0)),
         sprintf("%02d",tail(hourbr,-1)), sep="-"), include.lowest=T)
    table(a2$CR_HOUR)
    
    a2$VEH_YEAR <- cut(a2$VEH_YEAR, breaks=c(0,1990,2000,2010,Inf), 
        labels=c("1990 and Before","1991-2000","2001-2010","2011 and After"))
    table(a2$VEH_YEAR)
    

    It's a bit messy because I tried to make the same labels, but the function itself is very easy to use.