Search code examples
rlapplynumber-formattingstringrpad

In a list of data frames, pad one variable with leading zeros (ideally w/ stringr)


I'm working with a list of data frames. In each data frame, I would like to pad a single ID variable with leading zeros. The ID variables are character vectors and are always the first variable in the data frame. In each data frame, however, the ID variable has a different length. For example:

df1_id ranges from 1:20, thus I need to pad with up to one zero, df2_id ranges from 1:100, thus I need to pad with up to two zeros, etc.

My question is, how can I pad each data frame without having to write a single line of code for each data frame in the list.

As mentioned above, I can solve this problem by using the str_pad function on each data frame separately. For example, see the code below:

#Load stringr package
library(stringr)

#Create sample data frames
df1 <- data.frame("x" = as.character(1:20), "y" = rnorm(20, 10, 1), 
stringsAsFactors = FALSE)

df2 <- data.frame("v" = as.character(1:100), "y" = rnorm(100, 10, 1), 
stringsAsFactors = FALSE)

df3 <- data.frame("z" = as.character(1:1000), "y" = rnorm(1000, 10, 1), 
stringsAsFactors = FALSE)

#Combine data fames into list
dfl <- list(df1, df2, df3)

#Pad ID variables with leading zeros
dfl[[1]]$x <- str_pad(dfl[[1]]$x, width = 2, pad = "0")
dfl[[2]]$v <- str_pad(dfl[[2]]$v, width = 3, pad = "0")
dfl[[3]]$z <- str_pad(dfl[[3]]$z, width = 4, pad = "0")

While this solution works relatively well for a short list, as the number of data frames increases, it becomes a bit unwieldy.

I would love if there was a way that I could embed some sort of "sequence" vector into the width argument of the str_pad function. Something like this:

dfl <- lapply(dfl, function(x) {x[,1] <- str_pad(x[,1], width = SEQ, pad = 
"0")})

where SEQ is a vector of variable lengths. Using the above example it would look something like:

seq <- c(2,3,4)

Thanks in advance, and please let me know if you have any questions.

~kj


Solution

  • You could use Map here, which is designed to apply a function "to the first elements of each ... argument, the second elements, the third elements", see ?mapply for details.

    library(stringr)
    vec <- c(2,3,4) # this is the vector of 'widths', don't name it seq
    
    Map(function(i, y) {
      dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
      dfl[[i]] # this gets returned
    }, 
    # you iterate over these two vectors in parallel
    i = 1:length(dfl), 
    y = vec) 
    

    Output

    #[[1]]
    #   x         y
    #1 01  9.373546
    #2 02 10.183643
    #3 03  9.164371
    #
    #[[2]]
    #    v         y
    #1 001 11.595281
    #2 002 10.329508
    #3 003  9.179532
    #4 004 10.487429
    #
    #[[3]]
    #     z         y
    #1 0001 10.738325
    #2 0002 10.575781
    #3 0003  9.694612
    #4 0004 11.511781
    #5 0005 10.389843
    

    explanation

    The function that we pass to Map is an anonymous function, which more or less you provided in your question:

    function(i, y) {
      dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
      dfl[[i]] # this gets returned
    }
    

    You see the function takes two argument, i and y (choose other names if you like such as df and width), and for each dataframe in your list it modifies the first column dfl[[i]][, 1] <- ... . What the anonymous function does is it applies str_pad to the first column of each dataframe

    ... <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
    

    but you see that we don't pass a fixed value to the width argument, but y.

    Coming back to Map. Map now applies str_pad to the first dataframe, with argument width = 2, it applies str_pad to the second dataframe, with argument width = 3 and - you probably guessed it - it applies str_pad to the third dataframe in your list, with argument width = 4.

    The arguments are specified in the last two lines of the code as

    i = 1:length(dfl), 
    y = vec) 
    

    I hope this helps.


    data

    (consider to create a minimal example next time as the number of rows of the dataframes is not relevant for the problem)

    set.seed(1)
    df1 <- data.frame("x" = as.character(1:3), "y" = rnorm(3, 10, 1), 
                      stringsAsFactors = FALSE)
    
    df2 <- data.frame("v" = as.character(1:4), "y" = rnorm(4, 10, 1), 
                      stringsAsFactors = FALSE)
    
    df3 <- data.frame("z" = as.character(1:5), "y" = rnorm(5, 10, 1), 
                      stringsAsFactors = FALSE)
    
    #Combine data fames into list
    dfl <- list(df1, df2, df3)