Search code examples
rstringcomparison

String comparison in R not returning correct results for data


Some time ago i asked a question here (this question) and it was correctly answered. Basically i needed to get the value from one specific column into a new column, depending on a sister column.

I tried using the same logic to get different values in a new case with the data. the problem now seems to be that either R or the function are failing to recognize proper values in the dataframe when doing the comparison.

The function in question is as follows:

    Obtain_SD <- function(df,dfx,atr,country){

      df <- dplyr::left_join(df,dfx,by=c("cd85"="cd")) //dfx has the DAR and DAT columns

      DAR_cols <- grep("DAR",colnames(df))
      DAT_cols <- grep("DAT",colnames(df))

      df$ex90 <- df[DAT_cols][cbind(1:nrow(df),max.col(df[DAR_cols] == "90"))]
      return(df)
    }

According to this line:

df$ex90 <- df[DAT_cols][cbind(1:nrow(df),max.col(df[DAR_cols] == "90"))]

The program should add a column when it finds a value "90" in the DAR_cols with the value of the corresponding DAT_cols. This works fine in most of the cases but then this happens:

Browse[2]> df[422,"ex90"]
[1] NA

If i run some check commands i get the following answers:

Browse[2]> typeof(df[422,"DAR04"])
[1] "character"
Browse[2]> df[422,"DAR04"]
[1] "90"
Browse[2]> df[422,"DAR04"] == "90"
[1] TRUE

The column DAR04 (according to the summary(df) command) is of class character and mode character but the code returns for this line and some others (I change the format of the command for readability):

   ID CD    DATA DAR01 DAT01    ... DAR04 DAT04    ... DAR12 DAT12 ex90
   7  99034 ...  1     19000101 ... 90    20140715 ... NA    ""    NA

At the beginning i thought that there could be trailing or leading spaces but that's not the case. I don't know what else to check to solve my problem. any insight would be awesome. Thanks in advance.


Solution

  • You are inheriting the NA via max.col(df[DAR_cols] == "90"), since you have some NAs in the DAR_cols. E.g. DAR12 appears to be NA in the example you printed.

    I am also not entirely sure, whether you would actually want to use max.col(..., ties.method = "last"). The default is ties.method = "random".

    You could replace max.col(df[DAR_cols] == "90") with a custom apply that handles NAs:

    unname(apply(df[DAR_cols] == "90", 1, function(x) {
      res <- which(x)
      if (length(res) == 0) res <- NA
      if (length(res) > 1) res <- max(res) # or use min(res) if you rather want the first
      res
    }))