Search code examples
rfactors

collapsing two factors into one?


Possible Duplicate:
Joining factor levels of two columns in R

I'm fairly new to R, and I'm trying to make my recoding script somewhat more effective and "correct". I've tried searching the forums but that got me nowhere - perhaps I'm using the wrong terminology and missed it, so please bear with me if the question has already been put up.

I have two factor-variables that I wish to collapse into one factor variable. They stem from the same survey and both measure educational level. The reason I have two variables in the first place is because of an unfortunate survey-construction, but thats beside the point. The main point to be made is that they are mutually exclusive (you can only be in one).

My data looks like this:

education       education2
9th grade       <NA>
9th grade       <NA>
<NA>            9th grade
<NA>            10th grade
10th grade      <NA>
11th grade      <NA>
<NA>            9th grade
<NA>            11th grade
<NA>            <NA>

and my script looks like this:

highest.edu     <- vector(length=length(df$education))
a.grade       <- which(df$education=="9th grade")
a.grade2      <- which(df$education2=="9th grade")
b.grade      <- which(df$education=="10th grade")
b.grade2     <- which(df$education2=="10th grade")
c.grade      <- which(df$education=="11th grade")
c.grade2     <- which(df$education=="11th grade")

highest.edu[a.grade]      <- as.character(df$education)[a.grade]
highest.edu[a.grade2]     <- as.character(df$education2)[a.grade2]
highest.edu[b.grade]     <- as.character(df$education)[b.grade]
highest.edu[b.grade2]    <- as.character(df$education2)[b.grade2]
highest.edu[c.grade]     <- as.character(df$education)[c.grade]
highest.edu[c.grade2]    <- as.character(df$education2)[c.grade2]

highest.edu  <- factor(highest.edu)
highest.edu[highest.edu =="FALSE"] =NA
highest.edu  <- factor(highest.edu)

Off course this is not bad but when you have two factor-variables with 15 levels a couple of times or more you start looking for quicker alternatives.

I've tried something like this but without any luck:

a.grade   <- which(df$education=="9th grade" | df$education2=="9th grade")
b.grade  <- which(df$education=="10th grade" | df$education=="10th grade")
c.grade  <- which(df$education=="11th grade" | df$education2=="11th grade")

highest.edu[a.grade]      <- as.character(df$education)  
[a.grade]|as.character(df$education2)[a.grade]
highest.edu[b.grade]      <- as.character(df$education)          
[b.grade]|as.character(df$education2)[b.grade]

giving me this: Error in as.character(df$education)[9th grade] | as.character(df$education2)[9th grade]: operations are possible only for numeric, logical or complex types

Is there a way to overcome this?

Thanks for any suggestions in advance

edit:

the result I'm aiming at is this:

highest.education
9th grade
9th grade
9th grade
10th grade
10th grade
11th grade
9th grade
11th grade
<NA>

the post: 'Joining factor levels of two columns in R' seems to be going for another result

again, thank you


Solution

  • Once they're character strings it's easy

    # make them character types
    ed <- levels(df$education)[df$education]
    ed2 <- levels(df$education2)[df$education2]
    # make one new factor that integrates them
    ed[is.na(ed)] <- ed2[is.na(ed)]
    # make it a factor again
    ed <- factor(ed)
    

    You could accelerate the process by reading them in as characters in the first place, especially if you already set column types in read.table.