Search code examples
rdataframeconditional-statementsdata-manipulation

Average across rows and sum across columns if condition is met in R dataframe


I have an R dataframe that looks like this:

chr  bp          instances_1    instances_2    instances_sum
1    143926410   0              1              1
1    144075771   1              0              1
1    187762696   0              2              2
1    187783844   2              0              2
2    121596288   0              1              1
2    122042325   3              0              3
2    259939985   1              0              1
2    259991389   0              1              1

What I would like to do is group by 'chr', determine if two rows are within 1e7 base-pairs ('bp') from one another, and if they are, retain the average (and round the average) and sum across all other columns that met the condition. So, the final product would look like:

chr  bp          instances_1    instances_2    instances_sum
1    144001091   1              1              2
1    187773270   2              2              4
2    121819307   3              1              4
2    259965687   1              1              2

I tried the to manipulate the following code (using tidyverse) that I used for a similar kind of task that did it over multiple columns:

df_Pruned <- df |>
  group_by(chr_snp1, chr_snp2) |>
  mutate(grp = (abs(loc_snp1 - lag(loc_snp1, default = first(loc_snp1))) < 1e7) &
           (abs(loc_snp2 - lag(loc_snp2, default = first(loc_snp2))) < 1e7)) |>
  group_by(grp, .add=TRUE)  |>
  filter(pval == min(pval)) |>
  ungroup()|>
  select(-grp)

into this by trying to do the same over one grouping variable ('chr') and by trying to average and sum at the same time:

df_Pruned <- df |>
  group_by(chr) |>
  mutate(grp = (abs(bp - lag(bp, default = first(bp))) < 1e7)) |>
  group_by(grp, .add=TRUE)  |>
  filter(bp == mean(bp) & instances_sum == sum(instances_sum)) |>
  ungroup()|>
  select(-grp)

But I can't get it to work. I think I'm close but could use some help.


Solution

  • Using cumsum with the lag condition produces your expected output:

    df |>
      mutate(grp = cumsum(abs(bp - lag(bp, default = first(bp))) > 1e7)) |>
      group_by(chr, grp)  |>
      summarise(bp = mean(bp),
                across(starts_with("instance"), sum),
                .groups = "drop")
    
    # A tibble: 4 × 6
        chr   grp         bp instances_1 instances_2 instances_sum
      <int> <int>      <dbl>       <int>       <int>         <int>
    1     1     0 144001090.           1           1             2
    2     1     1 187773270            2           2             4
    3     2     2 121819306.           3           1             4
    4     2     3 259965687            1           1             2