Search code examples
rdataframedplyrrows

bind_rows() creates duplicate of each dataframe when binding them in R


Assuming this is my dataframe:

df <- data.frame(grp = c("ab -10", "ab 0", "ab 8", "ab -1",
                         "ab 6", "ab 6", "ab -10", "ab 1",
                         "ab -10", "ab 0", "ab 8", "ab -1",
                         "ab 6", "ab 6", "ab -10", "ab 1",
                         "d", "e", "e", "e"),
                 freq = c(1,0,0,1,0,1,2,0,1,0,2,2,1,1,0,1,0,2,2,1))
df
      grp freq
1  ab -10    1
2    ab 0    0
3    ab 8    0
4   ab -1    1
5    ab 6    0
6    ab 6    1
7  ab -10    2
8    ab 1    0
9  ab -10    1
10   ab 0    0
11   ab 8    2
12  ab -1    2
13   ab 6    1
14   ab 6    1
15 ab -10    0
16   ab 1    1
17      d    0
18      e    2
19      e    2
20      e    1

I want to have:

> finaldf
     grp freq
1 ab < 0    7
2 ab 0-5    1
3  ab 5+    5
4      d    0
5      e    5

This is what I tried:

df %>%
  bind_rows(df %>%
              filter(!grepl("ab", grp)),
            
            df %>%
              filter(grepl("ab", grp)) %>%
              mutate(grp = parse_number(grp)) %>%
              mutate(grp = cut(as.numeric(grp),
                                          breaks = c(-999, 0, 6, 999),
                                          labels = c("ab < 0", "ab 0-5", "ab 5+"),
                                          right = F))) %>%
              group_by(grp) %>%
              summarise(N =n())

but it seems like bind_rows is duplicating dataframes.

      grp freq
1  ab -10    1
2    ab 0    0
3    ab 8    0
4   ab -1    1
5    ab 6    0
6    ab 6    1
7  ab -10    2
8    ab 1    0
9  ab -10    1
10   ab 0    0
11   ab 8    2
12  ab -1    2
13   ab 6    1
14   ab 6    1
15 ab -10    0
16   ab 1    1
17      d    0
18      e    2
19      e    2
20      e    1
21      d    0
22      e    2
23      e    2
24      e    1
25 ab < 0    1
26 ab 0-5    0
27  ab 5+    0
28 ab < 0    1
29  ab 5+    0
30  ab 5+    1
31 ab < 0    2
32 ab 0-5    0
33 ab < 0    1
34 ab 0-5    0
35  ab 5+    2
36 ab < 0    2
37  ab 5+    1
38  ab 5+    1
39 ab < 0    0
40 ab 0-5    1

I can slice() half of the rows, but I m more interested in knowing what I am doing wrong?

Any other neat and pretty approach is also highly appreciated!


Solution

  • Here is one method where will split the column into 'two' with separate, recode the numeric values, unite and then do a group by sum

    library(dplyr)
    library(tidyr)
    df %>% 
      separate(grp, into = c('grp1', 'value'), sep = "(?<=ab)\\s+",
       fill = "right", convert = TRUE) %>% 
      mutate(value = case_when(value <0 ~ '< 0', 
         between(value, 0, 5) ~ '0-5', value > 5 ~ '5+')) %>%
      unite(grp, grp1, value, na.rm = TRUE, sep=" ") %>% 
      group_by(grp) %>%
      summarise(freq = sum(freq), .groups = 'drop')
    

    -output

    # A tibble: 5 × 2
      grp     freq
      <chr>  <dbl>
    1 ab < 0     7
    2 ab 0-5     1
    3 ab 5+      5
    4 d          0
    5 e          5
    

    In the OP's code, it is the beginning df %>% needs to removed as we are passing both filtered datasets in bind_rows. When we add the df %>%, it will be passed as the first argument to bind_rows, thus duplicating the rows

    library(readr)
    bind_rows(df %>%
                  filter(!grepl("ab", grp)),
                df %>%
                  filter(grepl("ab", grp)) %>%
                  mutate(grp = parse_number(grp)) %>%
                  mutate(grp = cut(as.numeric(grp),
                                              breaks = c(-999, 0, 6, 999),
                                              labels = c("ab < 0", "ab 0-5", "ab 5+"),
                                              right = FALSE))) %>% 
        group_by(grp) %>%
        summarise(N =sum(freq))
    # A tibble: 5 × 2
      grp        N
      <chr>  <dbl>
    1 ab < 0     7
    2 ab 0-5     1
    3 ab 5+      5
    4 d          0
    5 e          5