Search code examples
rdplyrplyr

How can I make a data frame of the unique values in an existing column?


I need to make a new data frame (col.3) using only the occurrences in a previous column (col.1) that correspond to unique values in another column (col.2) in an existing data frame.

I need this:

df1
col.1   col.2     
    1    1             
    1    3             
    1    7             
    1    7            
    2    12                
    2    14   
    2    14
    2    14

 df2
 col.3
     1
     1
     1
     2
     2 

I have tried this:

new.col <- cbind(df$col.1[unique(df$col.2)])

But it gives me a column that is both too long, and which does not include the complete set of col.1 values

I suspect that plyr has a simple solution to this, but I have not figured that (or any other solution) out.

How can I achieve my desired result? Preferably using plyr, but base is fine too.


Solution

  • We can use duplicated to create a logical index and use that to subset the rows

    df2 <- data.frame(col3. = df$col.1[!duplicated(df$col.2)])
    

    Or with subset

    subset(df, !duplicated(col.2), select = col.1)
    

    Or with dplyr, usedistinct on col.2 and then select the 'col.1'

    library(dplyr)
    df %>%
       distinct(col.2, .keep_all = TRUE) %>%
       select(col.3 = col.1)
    #  col.3
    #1     1
    #2     1
    #3     1
    #4     2
    #5     2
    

    If the duplicates are considered based on the equality between adjacent elements, then use rleid

    library(data.table)
    df %>% 
        filter(!duplicated(rleid(col.2))) %>% 
        select(col.3 = col.1)
    

    If we convert to data.table, the unique also have a by option

    library(data.table)
    unique(setDT(df), by = 'col.2')[, .(col.3 = col.1)]
    

    data

    df <- structure(list(col.1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), col.2 = c(1L, 
    3L, 7L, 7L, 12L, 14L, 14L)), class = "data.frame", row.names = c(NA, 
    -7L))