Search code examples
rdataframenested-listsuniqueidentifiermapply

Add unique identifier column to a nested list of data frames


Example Data

I am working with a nested list of data frames. My list contains over 1,000 lists, each containing a single data frame as their only element. Each data frame contains 30+ observations of 10+ variables. To keep things simple, here is a small example list:

df1 <- tibble::tibble(a = 1:25, b = 1:25, c = 1:25, d = 1:25, e = 1:25)
df2 <- tibble::tibble(a = 1:35, b = 1:35, c = 1:35, d = 1:35, e = 1:35)
df3 <- tibble::tibble(a = 1:30, b = 1:30, c = 1:30, d = 1:30, e = 1:30)
df4 <- tibble::tibble(a = 1:20, b = 1:20, c = 1:20, d = 1:20, e = 1:20)

dfs_list <- list(list(a = df1), list(a = df2), list(a = df3), list(a = df4))
  dfs_list

[[1]]
[[1]][[1]]
# A tibble: 25 x 5
       a     b     c     d     e
   <int> <int> <int> <int> <int>
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
# ... with 15 more rows


[[2]]
[[2]][[1]]
# A tibble: 35 x 5
       a     b     c     d     e
   <int> <int> <int> <int> <int>
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
# ... with 25 more rows


[[3]]
[[3]][[1]]
# A tibble: 30 x 5
       a     b     c     d     e
   <int> <int> <int> <int> <int>
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
# ... with 20 more rows


[[4]]
[[4]][[1]]
# A tibble: 20 x 5
       a     b     c     d     e
   <int> <int> <int> <int> <int>
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
# ... with 10 more rows

Desired Output

I'm trying to generate a column containing a unique identifier for each data frame in my list. The column would be based on two sequences of numbers; say, 1:10 and 1:100. For example, the column in the first data frame would contain 1.1, the second would contain 2.1, and so on, all the way up to 10.100.

Working off the smaller example from the beginning, let's make my number sequences 1:2 and 1:2. The identifier column below is what I'm looking to add to each data frame in my list:

[[1]]
[[1]][[1]]
# A tibble: 25 x 6
       a     b     c     d     e identifier
   <int> <int> <int> <int> <int> <chr>     
 1     1     1     1     1     1 1.1       
 2     2     2     2     2     2 1.1       
 3     3     3     3     3     3 1.1       
 4     4     4     4     4     4 1.1       
 5     5     5     5     5     5 1.1       
 6     6     6     6     6     6 1.1       
 7     7     7     7     7     7 1.1       
 8     8     8     8     8     8 1.1       
 9     9     9     9     9     9 1.1       
10    10    10    10    10    10 1.1       
# ... with 15 more rows


[[2]]
[[2]][[1]]
# A tibble: 35 x 6
       a     b     c     d     e identifier
   <int> <int> <int> <int> <int> <chr>     
 1     1     1     1     1     1 2.1      
 2     2     2     2     2     2 2.1       
 3     3     3     3     3     3 2.1       
 4     4     4     4     4     4 2.1       
 5     5     5     5     5     5 2.1       
 6     6     6     6     6     6 2.1       
 7     7     7     7     7     7 2.1       
 8     8     8     8     8     8 2.1       
 9     9     9     9     9     9 2.1       
10    10    10    10    10    10 2.1       
# ... with 25 more rows


[[3]]
[[3]][[1]]
# A tibble: 30 x 6
       a     b     c     d     e identifier
   <int> <int> <int> <int> <int> <chr>     
 1     1     1     1     1     1 1.2       
 2     2     2     2     2     2 1.2       
 3     3     3     3     3     3 1.2       
 4     4     4     4     4     4 1.2       
 5     5     5     5     5     5 1.2       
 6     6     6     6     6     6 1.2       
 7     7     7     7     7     7 1.2       
 8     8     8     8     8     8 1.2       
 9     9     9     9     9     9 1.2       
10    10    10    10    10    10 1.2       
# ... with 20 more rows


[[4]]
[[4]][[1]]
# A tibble: 20 x 5
       a     b     c     d     e identifier
   <int> <int> <int> <int> <int> <chr>     
 1     1     1     1     1     1 2.2       
 2     2     2     2     2     2 2.2       
 3     3     3     3     3     3 2.2       
 4     4     4     4     4     4 2.2       
 5     5     5     5     5     5 2.2       
 6     6     6     6     6     6 2.2       
 7     7     7     7     7     7 2.2       
 8     8     8     8     8     8 2.2       
 9     9     9     9     9     9 2.2       
10    10    10    10    10    10 2.2       
# ... with 10 more rows

Attempted Methods

I've tried creating an array using apply(expand.grid()), then binding one observation of the array to each data frame using mapply() :

a.b <- apply(expand.grid(c(1:2), c(1:2)), 1, paste, collapse = '.')
mapply(cbind, dfs_list, "Identifier" = a.b, SIMPLIFY = F)

However, the column is inserted to the parent list, instead of directly to the data frames:

[[1]]
               Identifier    
[1,] tbl_df,5 "1.1"

[[2]]
               Identifier    
[1,] tbl_df,5 "2.1"

[[3]]
               Identifier    
[1,] tbl_df,5 "1.2"

[[4]]
               Identifier    
[1,] tbl_df,5 "2.2"

After some trial and error, I attempted a slightly different approach later in the evening. At first I thought I'd solved my problem, but the list generated was 13 GB as opposed to 19 MB before, and took (relatively) much, much longer to write, so I doubt this is the solution. I'm also unable to reproduce my results using my example data set this morning.

> dfs_identify <- dfs_list %>% 
+   apply(function(z) mapply(cbind, z, "Identifier" = a.b, SIMPLIFY = F))
Error in match.fun(FUN) : argument "FUN" is missing, with no default

Solution

  • Map(`names<-`, dfs_list, a.b)
    

    This gives each list item the name you made. It doesn't say "Identifier", but I think this is what you are after.

    Edit:

    Map(function(x, y) list(cbind(x[[1]], "Identifier" = y)), dfs_list, a.b)
    

    This gives a new column of identifiers. x[[1]] is to get inside the nested structure, and list() re-creates the original nesting structure. Map is the same as mapply(..., simplify = FALSE)