Search code examples
rdataframe

R filter a dataframe with several values for one column, strange behaviour


I have a dataframe with several columns and I want to obtain all lines where the column of interest takes some values. Initially, I was using == as in

which(df$column==c(value1, value2))

It worked for certain vectors, but not all and after some researches, I found out that using %in% works better.

However, I want to understand why the == works for some cases but not all. In particular, I want to understand why I obtain the following results.

test <- data.frame("true_date"=as.Date(1:365, origin="2024-01-01"))

Why

which(test$true_date==c("2024-07-02","2024-07-03"))

returns

[1] 183 184
Message d'avis :
Dans `==.default`(test$true_date, c("2024-07-02", "2024-07-03")) :
  la taille d'un objet plus long n'est pas multiple de la taille d'un objet plus court

while

which(test$true_date==c("2024-07-03","2024-07-04"))

returns

integer(0)
Message d'avis :
Dans `==.default`(test$true_date, c("2024-07-03", "2024-07-04")) :
  la taille d'un objet plus long n'est pas multiple de la taille d'un objet plus court

One night of sleep and I understood the reason why I got those messages. Thanks one to confirm my too late understanding


Solution

  • As hinted by the warning message (longer object length is not a multiple of shorter object length), the shorter vector is recycled to match the longer object. In this example, we have:

    "2024-06-30" "2024-07-01" "2024-07-02" "2024-07-03" "2024-07-04" "2024-07-05" "2024-07-06"
                                 match        match
    "2024-07-02" "2024-07-03" "2024-07-02" "2024-07-03" "2024-07-02" "2024-07-03" "2024-07-02" 
    

    vs

    "2024-06-30" "2024-07-01" "2024-07-02" "2024-07-03" "2024-07-04" "2024-07-05" "2024-07-06"
                                         
    "2024-07-03" "2024-07-04" "2024-07-03" "2024-07-04" "2024-07-03" "2024-07-04" "2024-07-03"