Search code examples
rfilterduplicatessummarize

Remove duplicates based on second column


I am trying to write a section of code that does a few things: 1) group dataset by ID 2) count the number of unique months in column data.month 3) remove all IDs that have less than 9 months 4) print distinct IDs based on the company (ie print ID twice if related to 2 companies) 5) remove duplicated ID and keep the record that has the highest data.month number.

I have the code working until 5). I cant get my code to only print the record (row) of duplicate IDs that has the highest month number.

I looked at a few examples here:

R remove duplicates based on other columns

Remove duplicates based on 2nd column condition

I can figure out how to remove duplicates, but I'm having trouble applying it to my circumstances.

This is the two codes I have tried to achieve my goal:

data.check6 <- bind %>%
group_by(bind$ABN) %>%
summarise(count = n_distinct(data.month)) %>%
filter(count>8) %>%
rrange(bind$data.month) %>%
filter(row_number() == 1)

and:

 library(tidyverse)

 data.check7 <- bind %>%
  group_by(ABN)%>%      
  filter(1 == length(unique(bind$data.month)), !duplicated(bind$data.month))

Right now, I get the error:

Error in arrange_impl(.data, dots) : incorrect size (345343) at position 1, expecting : 3749

In the end I would like to have a dataset where each ID only appears once and it is the ID record associated with the highest month (ie. column value = 12)


Solution

  • I think you're looking for something like that:

    Example data:

    > bind <- data.frame(ABN = rep(1:3, 3),
    +                    data.month = sample(1:12, 9),
    +                    other.inf = runif(9))
    > 
    > bind
      ABN data.month other.inf
    1   1         10 0.8102867
    2   2          4 0.2919716
    3   3          8 0.3391790
    4   1          2 0.3698933
    5   2          6 0.9155280
    6   3          1 0.2680165
    7   1          9 0.7541168
    8   2          7 0.2018796
    9   3         11 0.1546079
    

    Solution:

    > bind %>%
    +   group_by(ABN) %>%      
    +   filter(data.month == max(data.month))
    # A tibble: 3 x 3
    # Groups:   ABN [3]
        ABN data.month other.inf
      <int>      <int>     <dbl>
    1     1         10     0.810
    2     2          7     0.202
    3     3         11     0.155