Here is the link of the dataset. I am trying to categorize my data. DR_AGE is working quite fine.
setwd("~/data1")
a2 <- read.csv("data1.csv")
dim(a2)
[1] 11503 7
names(a2)
[1] "CR_HOUR" "adt" "ln" "pav" "DR_AGE" "NUM_OCC" "VEH_YEAR"
## categorize DR_AGE
a2$DR_AGE[a2$DR_AGE < 25] <- "15-24"
a2$DR_AGE[a2$DR_AGE>24 & a2$DR_AGE < 35] <- "25-34"
a2$DR_AGE[a2$DR_AGE >34 & a2$DR_AGE < 45] <- "35-44"
a2$DR_AGE[a2$DR_AGE >44 & a2$DR_AGE < 55] <- "45-54"
a2$DR_AGE[a2$DR_AGE >54 & a2$DR_AGE < 65] <- "55-64"
a2$DR_AGE[a2$DR_AGE >64 & a2$DR_AGE < 75] <- "65-74"
a2$DR_AGE[a2$DR_AGE >74 ] <- "75 plus"
a2$DR_AGE <- factor(a2$DR_AGE)
table(a2[, "DR_AGE"]) ## All categories are generated.
15-24 25-34 35-44 45-54 55-64 65-74 75 plus
2298 2118 1638 1526 1036 511 350
But there's something wrong when I am trying to categorize CR_HOUR or VEH_YEAR.
## categorize CR_HOUR
a2$CR_HOUR[a2$CR_HOUR < 7] <- "00-06"
a2$CR_HOUR[a2$CR_HOUR>6 & a2$CR_HOUR < 13] <- "07-12"
a2$CR_HOUR[a2$CR_HOUR >12 & a2$CR_HOUR < 19] <- "13-18"
a2$CR_HOUR[a2$CR_HOUR >18 ] <- "19-24"
a2$CR_HOUR <- factor(a2$CR_HOUR)
table(a2[, "CR_HOUR"]) ### "07-12" is not generated. ????
00-06 10 11 12 13-18 19-24
1234 303 338 378 4152 5096
## categorize VEH_YEAR
a2$VEH_YEAR[a2$VEH_YEAR >1930 & a2$VEH_YEAR <1991] <- "1990 and Before"
a2$VEH_YEAR[a2$VEH_YEAR>1990 & a2$VEH_YEAR < 2001] <- "1991-2000"
a2$VEH_YEAR[a2$VEH_YEAR>2000 & a2$VEH_YEAR < 2011] <- "2001-2010"
a2$VEH_YEAR[a2$VEH_YEAR >2010] <- "2011 and After"
a2$VEH_YEAR<- factor(a2$VEH_YEAR)
table(a2[, "VEH_YEAR"]) ### "!990 and Before" is not generated. ????
1991-2000 2001-2010 2011 and After
4842 4763 57
I am struggling to fix the problem. Any help is appreciated.
The problem is that when you do
a2$CR_HOUR[a2$CR_HOUR < 7] <- "00-06"
you are assigning a character value to a numeric column. This causes the data type of CR_HOUR
to change to character and messes with down steam comparisons. This is not an effective way to recode data. It would be better to create a new character vector for the categorical names and then add it to the data.frame or replace the current column when all the substitutions have been done.
If you have ranges like this, the cut() command can be very useful. For example
agebr<-c(14,24,34,44,54,64,74,Inf)
a2$DR_AGE <-cut(a2$DR_AGE, breaks=agebr,
labels=paste(head(agebr,-1)+1, tail(agebr,-1), sep="-"))
table(a2$DR_AGE)
hourbr<-c(0,6,12,18,24)
a2$CR_HOUR <- cut(a2$CR_HOUR, breaks=hourbr,
labels=paste(sprintf("%02d", ifelse(head(hourbr,-1)>0,head(hourbr,-1)+1,0)),
sprintf("%02d",tail(hourbr,-1)), sep="-"), include.lowest=T)
table(a2$CR_HOUR)
a2$VEH_YEAR <- cut(a2$VEH_YEAR, breaks=c(0,1990,2000,2010,Inf),
labels=c("1990 and Before","1991-2000","2001-2010","2011 and After"))
table(a2$VEH_YEAR)
It's a bit messy because I tried to make the same labels, but the function itself is very easy to use.