Search code examples
rif-statementdplyrlapplyr-factor

Convert numeric columns to factors with different labels using key


I have a data set that has 900 columns of numeric data and I need to convert the numeric columns to factors that have labels. Many labels will repeat. I am trying to write a function that will take the numeric columns, identify the type of label that the column needs, and then apply that label.

Here is an example data frame:

#create data frame with columns a,b,c,d
a<-c(1,2,3,4,5)
b<-c(0,1,0,1,0)
c<-c(1,0,1,0,1)
d<-c(2,3,4,5,3)

x<-as.data.frame(cbind(a,b,c,d))

I have a separate dataframe (i.e. y) that includes a key (i.e. column e) that identifies which factor labels should be applied to which of the columns (i.e. column f). Notice that b and c should have the same label.

e<-c(1,2,2,3)
f<-c("a","b","c","d")

y<-as.data.frame(cbind(e,f))

I would like to write a function that does the following, but automated. Here are the example labels that I would like to apply to a,b,c,d--where a and d are different, but b and c are the same.

x$a<-factor(x$a,
    levels=c(1,2,3,4,5),
    labels=c("Less than 25%",
    "25-50%",
    "51-75%",
    "76-90%",
    "More than 90%"))

x$b<-factor(x$b,
    levels=c(0,1),
    labels=c("Yes","No"))

x$c<-factor(x$c,
    levels=c(0,1),
    labels=c("Yes","No"))

x$d<-factor(x$c,
    levels=c(1,2,3,4,5),
    labels=c("l","m","n","o","p"))

With the final data set looking like:

>x
              a   b   c d
1 Less than 25% Yes  No m
2        25-50%  No Yes n
3        51-75% Yes  No o
4        76-90%  No Yes p
5 More than 90% Yes  No n

In the actual data set, there will be close to 60 labels.


Solution

  • If you can get the labels and levels appropriately associated with the e column, which is your link to the columns of the dataset, you can do this via purrr:pmap_df.

    Here's how that would look. Most of the work is in getting the labels and levels as a list column, which I do via tibble (loaded with dplyr).

    Starting with your second dataset, y, which is an important part of this.

    e = c(1,2,2,3)
    f =  names(x)
    
    y = data.frame(e,f)
    
      e f
    1 1 a
    2 2 b
    3 2 c
    4 3 d
    

    Make sure the levels and labels are available and can be associated with your e vector. If they are in a long format, you could get them into a list-column format via tidyr::nest. I found this to be the most time-consuming step in terms of getting this info written out.

    library(dplyr)
    
    levels.labels = tibble(e = c(1, 2, 3),
                 levels = list(1:5, 0:1, 1:5),
                 labels = list(c("Less than 25%",
                                 "25-50%",
                                 "51-75%",
                                 "76-90%",
                                 "More than 90%"),
                               c("Yes","No"),
                               c("l","m","n","o","p")))
    

    If you needed to write your levels and labels out within R you might want to try tribble, which is available in the development version of the tibble package.

    library(tibble)
    levels.labels  = tribble(~e, ~levels, ~labels,
          1, 1:5, c("Less than 25%",
                   "25-50%",
                   "51-75%",
                   "76-90%",
                   "More than 90%"),
          2, 0:1, c("Yes","No"),
          3, 1:5, c("l","m","n","o","p"))
    

    Merge the levels and labels with your y dataset based on e. The rows of the result is a 1 to 1 match of the columns of x.

    key = left_join(y, levels.labels)
    
      e f        levels                                               labels
    1 1 a 1, 2, 3, 4, 5 Less than 25%, 25-50%, 51-75%, 76-90%, More than 90%
    2 2 b          0, 1                                              Yes, No
    3 2 c          0, 1                                              Yes, No
    4 3 d 1, 2, 3, 4, 5                                        l, m, n, o, p
    

    To factor each column, put the x dataset, the levels, and the labels all into a named list. The names of each element correspond to the names of the arguments you need to use from factor. This allows you to easily use pmap_df from purrr to factor each column of x, using the known levels and labels information.

    library(purrr)
    pmap_df(list(x = x, levels = key$levels, labels = key$labels), factor)
    
    # A tibble: 5 x 4
                  a      b      c      d
             <fctr> <fctr> <fctr> <fctr>
    1 Less than 25%    Yes     No      m
    2        25-50%     No    Yes      n
    3        51-75%    Yes     No      o
    4        76-90%     No    Yes      p
    5 More than 90%    Yes     No      n
    

    In pmap functions, the elements within the list must be all the same size. In this case, the first element has 4 columns and the second two are vectors with length 4.