Search code examples
rdataframerowencode

Encode unique observations using identifier


I have a data frame where one column is consisting of strings, which is a unique identifier to a journey. A reproducible data frame:

df <- data.frame(tours = c("ansc123123", "ansc123123", "ansc123123", "baa3999", "baa3999", "baa3999"),
                 order = rep(c(1, 2, 3), 2))

Now my real data is much larger with many more observations and unique identifiers, but I would like to have an output on the format as when you do something like this (but not manually encoded), so that the journeys with the same tours value get encoded as the same journey.

df$journey <- c(1, 1, 1, 2, 2, 2)

Solution

  • You can convert it to a factor.

    df$journey <- as.integer(factor(df$tours))
    
    df$journey
    #[1] 1 1 1 2 2 2
    

    Or use match and unique.

    match(df$tours, unique(df$tours))
    

    Its also possible to use factor and get the integer values with unclass. Here the levels are saved, what allows to come back to the original values.

    df$journey <- unclass(factor(df$tours))
    
    df$journey
    #[1] 1 1 1 2 2 2
    #attr(,"levels")
    #[1] "ansc123123" "baa3999"   
    
    levels(df$journey)[df$journey]
    #[1] "ansc123123" "ansc123123" "ansc123123" "baa3999"    "baa3999"   
    #[6] "baa3999"