Search code examples
rdplyrfile-importgff

Adding data to a dataframe based on groups


I'm working with bioinformatic data, with a gene in each row and statistics/metadata in the columns. Some genes are from the same organism which is indicated by column "ID", and I grouped the data on this variable.

data <- data %>%
  group_by(ID)

I want to add data from another file based on the ID (the grouping factor), so that rows with ID = a should have data from a file named a.gff and so on. The data I would like to add is from a .gff file containing gene locations. There is a gff file for ID=a, one for ID=b, one for ID=c etc named according to the ID (e.g. "a.gff").

What the data looks like:

Gene ID
CelA a
CelB a
Atl b
prT a
HUl c

Is there a way to implement a function to open a file for each ID grouping, do an operation and move onto the next ID?

I'm quite new to R, any help is much appreciated!


Solution

  • I think the easiest way to do this is by reading first all the .gff files. I'm not familiar with the format so my example will use the .csv extension. The following code reads all the files in the "dir" directory as a list column, then unnests it so is a regular tibble.

    After that you can just left_join() using both tibbles and then group by ID.

    library(tidyverse)
    
    binded <- tibble(
        file = list.files("dir"), # can remove before the join
        location = list.files("dir", full.names = TRUE), # can remove before the join
        ID = str_remove(file, "\.csv"),
        df = map(location, read_csv)
    ) %>% 
        unnest(df)
    
    data %>% 
        left_join(binded)