Search code examples
rdataframetransformationrecode

Recoding dataframe by transformation table


I have a dataframe that contains 3 columns. One of the columns (items) includes long strings (some include special characters such as semicolon ; for simplicity it is written here as Tx) that might repeat. I would like to find all the unique values of the specific column. For each unique value to create a nickname in the following format: OV1, OV2,... OVn. I would like to keep this table and to call it 'transformation table'. In the next step I would like to go to the source table and to recode each value with the value in transformation table.

Here is an example for the source table:

    items  sequenceID        eventID SIZE  
1:   T1      41595370             1    1
2:   T2      41595371             1    1
3:   T3      41595282             1    1
4:   T3      41595282             2    1
5:   T4      41595373             1    1
6:   T5      41595368             1    1
7:   T1      41595379             1    1

The result table would be:

     items     sequenceID      eventID SIZE  
1:   OV1       41595370             1    1
2:   OV2       41595371             1    1
3:   OV3       41595282             1    1
4:   OV3       41595282             2    1
5:   OV4       41595373             1    1
6:   OV5       41595368             1    1
7:   OV1       41595379             1    1

The transformation table would be:

1:   T1    OV1
2:   T2    OV2
3:   T3    OV3
4:   T4    OV4
5:   T5    OV5

The source data is in data6 (dataframe). I used the following orders:

u1 <- unique(data6$items)
u1 <- data.frame(u1)
ov <- c(paste("ov",1:nrow(u1),sep=""))

I would now to replace all the unique items column that resides in u1 with OV values and to create this as transformation table. Afterwards to replace all the values in data6$items.

Thanks a lot for your help.


Solution

  • It looks like your "item" can be seen as a factor variable which needs new labels. If so you can avoid the transformation table and recode the "item" within your source dataframe:

    data6$items = factor(data6$items, labels=paste0("OV", 1:length(unique(data6$items))))