Search code examples
rdplyrwhile-loopcolumnsorting

R- compare different columns of a data frame with different values


I am currently working on microdata, using a survey called SHARE. I want to use a variable for education but the way it was coded makes it kind of hard.

In the survey, households are asked what degree they have. There is one column for each degree and it takes the value 0 or 1 if the interviewed has the degree or not. The issue is that I have two countries with different degrees, but they are using the same column, so I have to go to the user manual to find for each country to which degree corresponds each 0 or 1. I was able to do so and then translate it to an international way of measuring education.

My idea was to sum each column and then having only one column for each household. However, I wasn't able to proceed because some people have many degrees. I would like to get the highest degree of each household. I would like to have your help on this issue.

Here are tables of what I have and what I would like:

Let imagine in Germany the first diplome is equivalent to the first diplome in international standards, the second and thee third in Germany are the same as the second diplom in international standards and the last diplom in Germany is the same as the third internationally. And in France we have first = first int., second = second int., third = third int. and no fourth diplom. Then I have a the table:

country= c( "Germany", "Germany", "Germany", "France" , "France", "France")
degree_one= c( 1, 1, 1, 1 , 1, 1)
degree_two = c( 0, 1, 0, 1 , 1, 0)
degree_three= c( 1, 0, 1, 1 , 1, 0)
degree_four = c( 1, 0, 0, NA ,NA,  NA)

f = data.frame(country,degree_one,degree_two,degree_three,degree_four)

Then I can translate and try to creat my variable degree by summing everything:

f$degree_one = ifelse(f$country == "Germany" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "Germany" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "Germany" & f$degree_three == 1,2,f$degree_three)
f$degree_four = ifelse(f$country == "Germany" & f$degree_four == 1,3,f$degree_four)

f$degree_one = ifelse(f$country == "France" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "France" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "France" & f$degree_three == 1,3,f$degree_three)
f$degree_four = ifelse(f$country == "France" & f$degree_four == "NA",0,f$degree_four)

f = replace(f, is.na(f), 0)

f2 = f %>% mutate(degree = degree_one + degree_two + degree_three + degree_four )

Unfortunately, it does not work and what I would like should look like this:

degree = c(3,2,2,3,3,1)
f3 = data.frame(f,degree)

I tried to do smth with a while loop but it did not work, as anyone any idea how I can solve my problem? I tried to make it as clear as possible, I hope you will understand and that someone as an idea on how to fix this.

Thanks :)


Solution

  • Here is an approach using data.table

    library(data.table)
    ##
    #  create degree map by country
    #
    degreeMap <- data.table(country=c('France', 'Germany'))
    degreeMap <- degreeMap[, .(degree=paste('degree', c('one', 'two', 'three', 'four'), sep='_')), by=.(country)]
    degreeMap[country=='France',  intlDegree:=c(1,2,3,NA)]
    degreeMap[country=='Germany', intlDegree:=c(1,2,2,3)]
    ##
    #   process your data
    #
    setDT(f)
    f[, indx:=1:.N]                     # need an index column to recover original order
    f[, HH:=1:.N, by=.(country)]        # need a  HH column to distinguish different HH w/in country
    maxDegree <- melt(f, id=c('country', 'HH', 'indx'), variable.name='degree', value.name = 'flag')
    maxDegree <- maxDegree[flag > 0]    # remove rows with flag=0 or NA
    setorder(maxDegree, HH, degree)
    maxDegree <- maxDegree[, .SD[.N], keyby=.(country, HH)]
    maxDegree[degreeMap, intlDegree:=i.intlDegree, on=.(country, degree)]
    setorder(maxDegree, indx)
    maxDegree
    ##    country HH indx       degree flag intlDegree
    ## 1: Germany  1    1  degree_four    1          3
    ## 2: Germany  2    2   degree_two    1          2
    ## 3: Germany  3    3 degree_three    1          2
    ## 4:  France  1    4 degree_three    1          3
    ## 5:  France  2    5 degree_three    1          3
    ## 6:  France  3    6   degree_one    1          1
    

    So this converts your f to a data.table and adds an index column and a HH column to distinguish between HH within a country.

    We then convert to long format using melt(...). In long format the four degree_ columns are reduced to two columns: a flag column indicating whether or not the degree applies, and a degree column indicating which degree.

    Then we remove all rows with 0 or NA flags, and then extract the last remaining row (highest degree) for each country and HH.

    Finally, we join to degreeMap to get the equivalent intlDegree.