Search code examples
rdataframeunique

R Program: Have dataframe with non-unique IDs. Need to create column with unique IDs


I have a dataframe that has duplicate IDs with unique values. I need to make the IDs unique.

df <- read.csv("test.csv")

ID: A1, A1, A2, A2, A3, A3, A4, A4  
Value: 0.5, 0.9, 1.5, 0.8, 2.2, 2.4, 3.1, 0.5

I need to get this dataframe:

ID: A1_1, A1_2, A2_1, A2_2, A3_1, A3_2, A4_1, A4_2  
Value: 0.5, 0.9, 1.5, 0.8, 2.2, 2.4, 3.1, 0.5

I tried the following code which adds a column with repeating alternating _1 and _2 and concatenates to the ID:

unique <- c("_1", "_2")  
Unique.col <- matrix(rep(unique, ))  
unique_ID <- cbind(df, Unique.col)  
unique_ID$ID <- paste(unique_ID$ID, unique_ID$Unique.col)  
unique_ID

I get the following dataframe where there is a space between the A1 and _1:

ID: A1 _1, A1 _2, A2 _1, A2 _2, A3 _1, A3 _2, A4 _1, A4 _2  
Value: 0.5, 0.9, 1.5, 0.8, 2.2, 2.4, 3.1, 0.5

Is there a better way to do this or a way to get rid of the space?


Solution

  • A general dplyr/tidyr way to solve this is to leverage pivot_longer and pivot_wider together: lengthen, then group by the original column names and create unique within-group IDs, then widen back. This seems a little bit deviant from the usual spirit of pivoting, but it gets the job done!

    Sample data:

    df <- tribble(
        ~"A1", ~"A1", ~"A2", ~"A2", ~"A3", ~"A3", ~"A4", ~"A4",
        1, 2, 3, 4, 5, 6, 7, 8
    )
    

    For creating the unique within-group IDs, see this answer. For combining those IDs with the original column names, the tidyr pivoting vignette has some great examples (e.g. here). The key is to use the names_from argument to combine the original column name with the new IDs. This gets us to:

    df %>%
        # Pivot original column names to "name" column and original values to "value" column
        pivot_longer(cols=everything()) %>%
        # Create unique IDs within each original column
        group_by(name) %>%
        mutate(row_id=row_number()) %>%
        # Pivot back to the desired wider format
        pivot_wider(names_from=c(name, row_id))
    

    Output:

    # A tibble: 1 x 8
       A1_1  A1_2  A2_1  A2_2  A3_1  A3_2  A4_1  A4_2
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1     1     2     3     4     5     6     7     8
    

    We can also change the format of the new column names in the call to pivot_wider, either with names_sep (default is _) or with names_pattern (which takes a regular expression).