I have a dataframe "df" as below:
V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a
Is there a way that I can automate the following 3 steps in R?
Step 1: R identifies that in the 12 rows of the dataframe "df" & the pattern "a b" is repeating majority of the times.
Step2: Based on the majority pattern in Step 1, R subsets a dataframe with only those rows which contain the majority pattern in Step 1.
Step3: R outputs a new sub-setted dataframe of Step2.
Is there a package to do this or a function that I can build? Any guidance will be very valuable. Thank You
With your updated Q, you may try this with data.table
:
library(data.table)
setDT(df)
cols <- c("V1", "V2")
df[, .N, by = cols][N == max(N)][, N := NULL][df, on = cols, nomatch = 0]
# V1 V2 id
#1: a b 3
#2: a b 5
#3: a b 6
#4: a b 7
#5: a b 9
#6: a b 10
#7: a b 11
Explanation
setDT(df)
coerces the data.frame to data.table
without copying.df
is joined so that all rows of df
which match the majority pattern are selected. nomatch = 0
specifies an inner join. This completes Step 2 and perhaps also Step 3 (but this is unclear to me, see comment.id
column are kept in the result without any additional effort. This would be the case for any other additional column in df
as well.Data
df <- fread("id V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a")
# EDIT: By request of the OP, the row number (id) should be kept
# df[, id := NULL]