Search code examples
rmergelabellevels

How can I merge a df in R with category values with another df with its corresponding values?


EDIT: Rephrasing: The situation described can be the result of data extraction from different statistical programs, that may produce separately csv files with (a) factors levels and (b) their corresponding assigned "numerical" values.

1) I may have many factor variables e.g. gender, age ranges and these factor variables have levels e.g. male/famele, 18-30/31-40 etc. 2) These levels are assigned to some numbers, either ordered or not. 3) The factor/level dataframe is one dataframe / dataset. The assigned to factor levels dataset is a separated dataframe. 4) I would like to join these two datasets into a single one. This means that the ordered factor levels must be retained and correctly assigned to their corresponding numbers.

So Factor levels and their assigned numbers are kept into two different datasets (e.g. csv files). These two dataframes must be "merged".

How can I do that? p.s. There is a cmmon ID variable between these two datasets.

              d1_levels          d2_levels      d3_levels
1               2                   2               0
2               0                   1               2
3               1                   2               1
4               2                   2               2


              d1_labels        d2_labels      d3_labels                                 
1               boy              east               <3kg
2               dont know        south              3kg
3               girl             east               >3kg
4               boy              east               3kg

I would like to have the same result for d1_labels, as the result of the below R command

dataset$d1_labels<- factor(d1_levels, levels = c(0,1,2), labels = c("dont know", "girl", "boy"))

Solution

  • The question is not fully clear:

    • The term merge is used when we are talking about two data.frames to be combined according to some key.
    • The terms levels and labelsare used with factors

    We will try both variants and hope the OP will specify what he is after.

    Merge

    merge(DF1, DF2, by = "rn")
    #  rn d1_levels d2_levels d3_levels d1_labels d2_labels d3_labels
    #1  1         2         2         0       boy      east      <3kg
    #2  2         0         1         2 dont know     south       3kg
    #3  3         1         2         1      girl      east      >3kg
    #4  4         2         2         2       boy      east       3kg 
    

    Factor

    reorder(factor(DF2$d1_labels), DF1$d1_levels)
    #[1] boy       dont know girl      boy      
    #attr(,"scores")
    #      boy dont know      girl 
    #        2         0         1 
    #Levels: dont know girl boy
    
    reorder(factor(DF2$d2_labels), DF1$d2_levels)
    #[1] east  south east  east 
    #attr(,"scores")
    # east south 
    #    2     1 
    #Levels: south east
    
    reorder(factor(DF2$d3_labels), DF1$d3_levels)
    #[1] <3kg 3kg  >3kg 3kg 
    #attr(,"scores")
    #<3kg >3kg  3kg 
    #   0    1    2 
    #Levels: <3kg >3kg 3kg
    

    factor() creates the factors, reorder() orders the factor levels according to the sequence given in the levels columns. In R, numbering of levels start with 1.

    The single results can be combined back to one dataframe (but note that this not the preferred way in R to convert many columns manually.)

    result <- data.frame(
      rn = DT1$rn,
      d1 = reorder(factor(DF2$d1_labels), DF1$d1_levels),
      d2 = reorder(factor(DF2$d2_labels), DF1$d2_levels),
      d3 = reorder(factor(DF2$d3_labels), DF1$d3_levels)
    )
    

    Combining levels and labels for many factor columns

    The OP has clarified the question and has requested to combine levels and labels for as many as, e.g., 500 factor columns.

    Unfortunately, this is quite complex as it requires to bring together data from two different data.frames which have been named differently. It would be much easier if the matching columns in both data.frames would have been named equally, e.g., d1. So, we have to bring together d1_levels from DF1 with d1_labels from DF2.

    Get base names of columns

    base_names <- na.omit(unique(stringr::str_extract(c(names(DF1), names(DF2)), ".+(?=_levels$)")))
    
    base_names
    #[1] "d1" "d2" "d3"
    

    Create new data.frame

    result <- as.data.frame(
      setNames(
        lapply(base_names, function(x) {
          reorder(factor(DF2[[paste0(x, "_labels")]]), DF1[[paste0(x, "_levels")]])
        }), base_names
      )
    )
    
    result
    #         d1    d2   d3
    #1       boy  east <3kg
    #2 dont know south  3kg
    #3      girl  east >3kg
    #4       boy  east  3kg
    
    str(result)
    #'data.frame':  4 obs. of  3 variables:
    # $ d1: Factor w/ 3 levels "dont know","girl",..: 3 1 2 3
    #  ..- attr(*, "scores")= num [1:3(1d)] 2 0 1
    #  .. ..- attr(*, "dimnames")=List of 1
    #  .. .. ..$ : chr  "boy" "dont know" "girl"
    # $ d2: Factor w/ 2 levels "south","east": 2 1 2 2
    #  ..- attr(*, "scores")= num [1:2(1d)] 2 1
    #  .. ..- attr(*, "dimnames")=List of 1
    #  .. .. ..$ : chr  "east" "south"
    # $ d3: Factor w/ 3 levels "<3kg",">3kg",..: 1 3 2 3
    #  ..- attr(*, "scores")= num [1:3(1d)] 0 1 2
    #  .. ..- attr(*, "dimnames")=List of 1
    #  .. .. ..$ : chr  "<3kg" ">3kg" "3kg"