Search code examples
rfunctiondplyrapplyanti-join

Apply a dplyr function to one common column across 30 dataframes


I have up to 30 data frames with a common ID column. There are other colums in each df but im just showing the ID here.

Library      DF1         DF2      DF3

ID#          ID#         ID#      ....
1111         1111        1112     ....
2222         1111        3333     ....
3333         3333        3333     ....
4444         2222        4444     ....

I have to compare the ID# colum in each of theses tables to the library ID colum to make sure the Id number matches an ID number in the library.

Currently I use dplyr and do...

DF1 %>%
      anti_join(library, by = 'ID#') 

and the same for each table. It would just return any ID numbers that are not in the library, I do this same command for every data table but would like to run it for all 30 of my tables. I put all my DFs in a list but I'm not really sure how to proceed, for loop? apply? any help would be appreciated, as this pushes my boundaries of R knowledge.


Solution

  • You can iterate over a list of data.frame using purrr. Here is an example using 3 data.frame to extract ID that are not common with reference one.

    You can use any map_* fonction taht suits you best any function you want inside a map_* function

    see purrr website for more info


    library(dplyr)
    #> 
    #> Attachement du package : 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    set.seed(999)
    df_library <- data_frame(ID = sort(sample(1:12, 10)))
    df1 <- data_frame(ID = sort(sample(1:12, 10)))
    df2 <- data_frame(ID = sort(sample(1:12, 10)))
    df3 <- data_frame(ID = sort(sample(1:12, 10)))
    
    
    library(purrr)
    #> 
    #> Attachement du package : 'purrr'
    #> The following objects are masked from 'package:dplyr':
    #> 
    #>     contains, order_by
    
    list(df1 = df1, df2 = df2, df3 = df3) %>%
      map_df(~ anti_join(.x, df_library, by = "ID"), .id = "df_name")
    #> # A tibble: 4 x 2
    #>   df_name    ID
    #>     <chr> <int>
    #> 1     df1    12
    #> 2     df2    12
    #> 3     df3     3
    #> 4     df3    12