Search code examples
rdataframedata-manipulationdata-cleaning

How to make a new df from subset of values from another df?


I have a df that looks like this:

    gene ID  Probe ID Chromosome     Start      Stop
    1:   H3F3A 539154271          1 226259488 226259567
    2:   H3F3A 539154249          1 226259368 226259447
    3:   H3F3A 539154244          1 226259238 226259317
    4:   H3F3A 539154241          1 226259118 226259197
    5:   H3F3A 539154231          1 226258998 226259077
    6:   H3F3A 539154255          1 226258868 226258947

The df contains 7830 rows corresponding to 40 different genes ID.

From this df, I would like to get a df with 40 rows, each row corresponding to each of the 40 genes, for which 'Start' would be the first Start value and 'Stop' would be the last Stop value of each gene (i.e for the first gene Start value would be 226259488 and Stop value would be 226258947. This new df would look like (ex for the first gene ID):

    gene ID  Probe ID Chromosome     Start      Stop
    1:   H3F3A 539154271          1 226259488 226258947

Any help would be great. Thx


Solution

  • Based on the format, it looks like a data.table, so using the data.table methods, grouped by 'gene ID', 'Chromosome', get the first of 'Start', last of 'Stop' and other columns, summarise the data

    library(data.table)
    df[, .(`Probe ID` = first(`Probe ID`), Start = first(Start),
         Stop = last(Stop)), by = .(`gene ID`, Chromosome)]