Search code examples
rdataframesubsetsimilarity

Sub-setting rows/columns in a dataframe based on Similarity in R


I have a dataframe "df" as below:

       V1 V2
   1   b  a
   2   b  a
   3   a  b
   4   b  a
   5   a  b
   6   a  b
   7   a  b
   8   b  a
   9   a  b
   10  a  b
   11  a  b
   12  b  a

Is there a way that I can automate the following 3 steps in R?

Step 1: R identifies that in the 12 rows of the dataframe "df" & the pattern "a b" is repeating majority of the times.

Step2: Based on the majority pattern in Step 1, R subsets a dataframe with only those rows which contain the majority pattern in Step 1.

Step3: R outputs a new sub-setted dataframe of Step2.

Is there a package to do this or a function that I can build? Any guidance will be very valuable. Thank You


Solution

  • With your updated Q, you may try this with data.table:

    library(data.table)
    setDT(df)
    cols <- c("V1", "V2")
    df[, .N, by = cols][N == max(N)][, N := NULL][df, on = cols, nomatch = 0]
    #   V1 V2 id
    #1:  a  b  3
    #2:  a  b  5
    #3:  a  b  6
    #4:  a  b  7
    #5:  a  b  9
    #6:  a  b 10
    #7:  a  b 11
    

    Explanation

    1. setDT(df) coerces the data.frame to data.table without copying.
    2. The relevant columns are defined to save typing.
    3. The number of occurences of each combination in the relevant columns are counted.
    4. Only the combination with the highest number of occurences is kept. This completes Step 1 of the Q which asks to find the majority pattern.
      Note that in case of a tie, i.e., two or more combinations have the same maximum number of occurences, all combinations are returned. The OP hasn't specified how he wants to handle this case.
    5. The counts are removed as they are no longer needed.
    6. Finally, the original df is joined so that all rows of df which match the majority pattern are selected. nomatch = 0 specifies an inner join. This completes Step 2 and perhaps also Step 3 (but this is unclear to me, see comment.
      Note that the row numbers in the id column are kept in the result without any additional effort. This would be the case for any other additional column in dfas well.

    Data

    df <- fread("id    V1 V2
       1   b  a
       2   b  a
       3   a  b
       4   b  a
       5   a  b
       6   a  b
       7   a  b
       8   b  a
       9   a  b
       10  a  b
       11  a  b
       12  b  a")
    # EDIT: By request of the OP, the row number (id) should be kept
    # df[, id := NULL]