Search code examples
rr-factor

R: factor levels, recode rest to 'other'


I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing categories with few observations into "other" and am looking for a quick way to do that--I have a perhaps 20 levels of a variable, but am interested in collapsing a bunch of them to one.

data <- data.frame(employees = sample.int(1000,500),
                   naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
                                  100, replace=T))

Here are my levels of interest, and their labels in separate vectors.

#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
         '621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
               'Offices of dentists',
               'Offices of all other miscellaneous health practitioners',
               'Home health care services',
               'Offices of Mental Health Practitioners',
               'Offices of chiropractors',
               'Medical Laboratories',
               'Outpatient Mental Health and Substance Abuse Centers',
               'Offices of optometrists')

I could use the factor() call, enumerate them all, classifying as "other" for each time a category had few observations.

Assuming that the top8 and top8_desc above are the actual top 8, what is the best way to declare data$naics as a factor variable so that the values in top8 are correcly coded and everything else is recoded as other?


Solution

  • I think the easiest way is to relabel all the naics not in the top 8 to a special value.

    data$naics[!(data$naics %in% top8)] = -99
    

    Then you can use the "exclude" option when turning it into a factor

    factor(data$naics, exclude=-99)