I have a df that looks like this:
gene ID Probe ID Chromosome Start Stop
1: H3F3A 539154271 1 226259488 226259567
2: H3F3A 539154249 1 226259368 226259447
3: H3F3A 539154244 1 226259238 226259317
4: H3F3A 539154241 1 226259118 226259197
5: H3F3A 539154231 1 226258998 226259077
6: H3F3A 539154255 1 226258868 226258947
The df contains 7830 rows corresponding to 40 different genes ID.
From this df, I would like to get a df with 40 rows, each row corresponding to each of the 40 genes, for which 'Start' would be the first Start value and 'Stop' would be the last Stop value of each gene (i.e for the first gene Start value would be 226259488 and Stop value would be 226258947. This new df would look like (ex for the first gene ID):
gene ID Probe ID Chromosome Start Stop
1: H3F3A 539154271 1 226259488 226258947
Any help would be great. Thx
Based on the format, it looks like a data.table
, so using the data.table
methods, grouped by 'gene ID', 'Chromosome', get the first
of 'Start', last
of 'Stop' and other columns, summarise the data
library(data.table)
df[, .(`Probe ID` = first(`Probe ID`), Start = first(Start),
Stop = last(Stop)), by = .(`gene ID`, Chromosome)]