Search code examples
ralteryx

Delete repeated fields in an R data.frame


I'm very new on R and I need your help to do something that Until now i can't do:

I have a data frame that could have a random number of columns, I need to mantain in each column of the data frame only the unique values, but this have to be done independent of the other columns:

For example, if have the below data frame:

 Column_A   Column_B    Column_C
    A               1           A1  
    A               2           A2
    B               1           A3
    B               2           A4
    C               3           A5
    C               4           A6

The output for this, after the code must be:

Column_A    Column_B    Column_C
A               1           A1
B               2           A2
C               3           A3
                4           A4
                            A5
                            A6

I've tried with ds <- unique(ds) but it only will leave the unique relations between all the columns.

I will really apreciate any help or orientation you could gave me.

Thanks in advance.

Data

`> str(df)
'data.frame':   6 obs. of  3 variables:
 $ A: chr  "A" "B" "C" "A" ...
 $ B: num  1 2 1 2 3 4
 $ C: chr  "A1" "A2" "A3" "A4" ...`

Loop

`i <- 1`
`while (i < ncol(df)){
+  df[i] <-  lapply(df, function(x) {
+           x[duplicated(x)] <- ''
+           c(x[x!=''], x[x==''])})
+ i <- i+1
+}`

Solution

  • If there are 'factor' columns, it is better to convert it to character or include '' as one of the levels of the factor column. Here, I am changing the factor columns to character first.

     indx <- sapply(df1, is.factor)
     df1[indx] <- lapply(df1[indx], as.character) 
    

    Loop the columns with lapply, replace the duplicated elements with '', arrange the elements so that the empty strings will be at the end (c(x[x=''],x=='']))

     df1[] <-  lapply(df1, function(x) {
               x[duplicated(x)] <- ''
               c(x[x!=''], x[x==''])})
     df1
     #   Column_A Column_B Column_C
     #1        A        1       A1
     #2        B        2       A2
     #3        C        3       A3
     #4                 4       A4
     #5                         A5
     #6                         A6
    

    Or another option would be to use match

    df1[] <- lapply(df1, function(x) c(x[match(unique(x),x)],
                   rep('', length(x)-length(unique(x)))))
    

    NOTE: Using '' will change the numeric column classes to 'character/factor' class. It may be better to replace with NA which can be easily deleted also with custom functions is.na/na.omit/complete.cases etc..

    data

     df1 <- structure(list(Column_A = structure(c(1L, 1L, 2L, 2L, 3L, 3L), 
     .Label = c("A", 
     "B", "C"), class = "factor"), Column_B = c(1L, 2L, 1L, 2L, 3L, 
     4L), Column_C = structure(1:6, .Label = c("A1", "A2", "A3", "A4", 
     "A5", "A6"), class = "factor")), .Names = c("Column_A", "Column_B", 
     "Column_C"), row.names = c(NA, -6L), class = "data.frame")