Search code examples
rfactorslevels

Overwrite levels of factor columns in one dataframe using another


I have 2 data frames with multiple factor columns. One is the base data frame and the other is the final data frame. I want to update the levels of the base data frame using the final data frame.

Consider this example:

base <- data.frame(product=c("Business Call", "Business Transactional", 
                             "Monthly Non-Compounding and Standard Non-Compounding",
                             "OCR based Call", "Offsale Call", "Offsale Savings",
                             "Offsale Transactional", "Out of Scope","Personal Call"))
base$product <- as.factor(base$product)

final <- data.frame(product=c("Business Call", "Business Transactional", 
                              "Monthly Standard Non-Compounding", "OCR based Call", 
                              "Offsale Call", "Offsale Savings","Offsale Transactional", 
                              "Out of Scope","Personal Call", "You Money")) 
final$product <- as.factor(final$product)

What I would now want is for the final data base to have the same levels as base and remove the levels which do not exist at all like "You Money". Whereas "Monthly Standard Non-Compounding" to be fuzzy matched

Eg:

levels(base$var1) <- "a" "b" "c"
levels(final$var1) <- "Aa" "Bb" "Cc"

Is there a way to overwrite the levels in base data using the final data using some kind of fuzzy match?

Like I want the final levels for both data to be the same. i.e.

levels(base$var1) <- "Aa" "Bb" "Cc"
levels(final$var1) <- "Aa" "Bb" "Cc"

Solution

  • We could build our own fuzzyMatcher.

    First, we'll need kinda vectorized agrep function,

    agrepv <- function(x, y) all(as.logical(sapply(x, agrep, y)))
    

    on which we build our fuzzyMatcher.

    fuzzyMatcher <-  function(from, to) { 
      mc <- mapply(function(y) 
        which(mapply(function(x) agrepv(y, x), Map(levels, to))), 
        Map(levels, from))
      return(Map(function(x, y) `levels<-`(x, y), base, 
                 Map(levels, from)[mc]))
    }
    

    final labels applied on base labels (note, that I've shifted columns to make it a little more sophisticated):

    base[] <- fuzzyMatcher(final1, base1)
    #    X1 X2
    # 1  Aa Xx
    # 2  Aa Xx
    # 3  Aa Yy
    # 4  Aa Yy
    # 5  Bb Yy
    # 6  Bb Zz
    # 7  Bb Zz
    # 8  Aa Xx
    # 9  Cc Xx
    # 10 Cc Zz
    

    Update

    Based on the new provided data above it'll make sense to use another vectorized agrepv2(), which, used with outer(), enables us to apply agrep on all combinations of the levels of both vectors. Hereafter colSums that equal zero give us non-matching levels and which.max the matching levels of the target data frame final. We can use these two resulting vectors on the one hand to delete unused rows of final, on the other hand to subset the desired levels of the base data frame in order to rebuild the factor column.

    # add to mimic other columns in data frame
    base$x <- seq(nrow(base))
    final$x <- seq(nrow(final))
    
    # some abbrevations for convenience
    p1 <- levels(base$product)
    p2 <- levels(final$product)
    
    # agrep
    AGREPV2 <- Vectorize(function(x, y, ...) agrep(p2[x], p1[y]))  # new vectorized agrep 
    out <- t(outer(seq(p2), seq(p1), agrepv2, max.distance=0.9))  # apply `agrepv2`
    del.col <- grep(0, colSums(apply(out, 2, lengths))) # find negative matches
    lvl <- unlist(apply(out, 2, which.max))  # find positive matches
    lvl <- as.character(p2[lvl])  # get the labels
    
    # delete "non-existing" rows and re-generate factor with new labels
    transform(final[-del.col, ], product=factor(product, labels=lvl))
    #                  product x
    # 1          Business Call 1
    # 2 Business Transactional 2
    # 4         OCR based Call 4
    # 5           Offsale Call 5
    # 6        Offsale Savings 6
    # 7  Offsale Transactional 7
    # 8           Out of Scope 8
    # 9          Personal Call 9
    

    Data

    base1 <- structure(list(X1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
    3L, 3L), .Label = c("a", "b", "c"), class = "factor"), X2 = structure(c(1L, 
    1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("x", "y", "z"
    ), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
    
    final1 <- structure(list(X1 = structure(c(1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L, 
    2L, 2L, 3L, 3L, 2L, 2L, 2L), .Label = c("Xx", "Yy", "Zz"), class = "factor"), 
        X2 = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 
        2L, 2L, 2L, 2L, 3L), .Label = c("Aa", "Bb", "Cc"), class = "factor")), row.names = c(NA, 
    -15L), class = "data.frame")