Search code examples
rfor-loopif-statementmatchingstringdist

Use if/then for loop and amatch or match to find similar values and match two dataframe columns?


I have two dataframes, one with raw data labels and one with the correct adjusted values the data needs to be matched to. The labels are numeric but can differ up to +/- 2. I am trying to figure out how to write a coded if/then loop since amatch does not work well for numerics. The goal is to have a loop where for every value in the raw data, it will check against the values in the corrected data and match to the closest match if one is present where corrected - raw is between -2 and 2. I have pasted my attempted but very much nonfunctional attempt below.

My thought is that it may be possible to use amatch to select the best/closest match when one is found, since in some cases there are several data label values close together while in others there is a shift in the number up to 2.

Is there a way to write such a code, or another better way to accomplish this? The goal is to have a corrected column matched to the raw data labels that I can then use to merge with the raw data and the additional metadata attached to the corrected labels, but for my full list of labels only about 60% match without needing this adjustment (you can see in the sample data, for example, 1910 should match to 1911 and 2056 needs to match to 2057). Because of the nature of the data, the differences are not consistent and I want this to be a function that I can apply to more than just a single instance of data so that I do not have to go through and match every data label by hand.

raw <- c(1419, 1444, 1460, 1485, 1501, 1542, 1581, 1590, 
         1606, 1622, 1647, 1663, 1688, 1704, 1743, 1791, 
         1793, 1809, 1850, 1866, 1891, 1905, 1910, 1954, 
         1956, 1976, 1996, 2012, 2028, 2041, 2053, 2056, 
         2067, 2100, 2102, 2122)

corrected <- c(1419, 1444, 1460, 1485, 1501, 1542, 1562, 
               1581, 1590, 1606, 1622, 1630, 1647, 1663, 
               1688, 1704, 1743, 1792, 1793, 1809, 1825, 
               1834, 1850, 1866, 1891, 1905, 1911, 1914, 
               1938, 1954, 1955, 1971, 1976, 1996, 2012, 
               2019, 2028, 2053, 2057, 2100, 2101, 2122)


labelmatch <- function(x, y) {data.frame(glycan=x, glycan_name=
                                            (for(i in 1:length(x)) {
                                              for(n in 1:length(y)) {
                                                if (n-i <= 2 & n-i >=-2) {
                                                  match(x, y)} else{
                                                    if (n-i >= 2 | n-i <=-2){
                                                  next}}}}))
}

labelmatch(raw, corrected)

Solution

  • Since your corrected data is sorted, we can use that fact to quickly search through the vector. Inspired by np.searchsorted

    searchsorted <- function(findIn, vec, isSorted = TRUE){
      if(!isSorted) findIn <- sort(findIn)
      idx <- rank(c(vec, findIn, -Inf),, 'first')[seq_along(vec)] - rank(vec)
      right_vals <- findIn[idx]
      left_vals <- findIn[(idx2<-idx - 1) + !idx2]
      right_vals[na_idx] <- left_vals[na_idx<- is.na(right_vals)]
      right_vals[idx2] <- left_vals[idx2<- abs(right_vals - vec) > abs(left_vals - vec)]
      is.na(right_vals) <- abs(right_vals - vec) > 2
      right_vals
    }
    searchsorted(corrected, raw)
    [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
    [16] 1792 1793 1809 1850 1866 1891 1905 1911 1954 1954 1976 1996 2012 2028   NA
    [31] 2053 2057   NA 2100 2100 2122
    

    --

    Edit:

    R does provide the function findInterval which could be used to simplify the task:

    searchsorted <- function(x, vec){
      idx <- findInterval(x, vec, all.inside = TRUE)
      vals <- vec[idx]
      idx2 <- abs(vals - x) > 2
      vals2 <- vec[idx[idx2] + 1]
      is.na(vals2) <- vals2 - x[idx2] > 2
      replace(vals, idx2, vals2)
    }
    
    searchsorted(raw, corrected)
     [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704
    [15] 1743 1792 1793 1809 1850 1866 1891 1905 1911 1954 1955 1976 1996 2012
    [29] 2028   NA 2053 2057   NA 2100 2101 2122