Search code examples
rvectorizationsapply

Unexpected results using str_split and union in a function with sapply


Given this data.frame:

library(dplyr)
library(stringr)
ml.mat2 <- structure(list(value = c("a", "b", "c"), ground_truth = c("label1, label3", 
"label2", "label1"), predicted = c("label1", "label2,label3", 
"label1")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

glimpse(ml.mat2)
Observations: 3
Variables: 3
$ value        <chr> "a", "b", "c"
$ ground_truth <chr> "label1, label3", "label2", "label1"
$ predicted    <chr> "label1", "label2,label3", "label1"

I want to measure the length of the intersect between ground_truth and predicted for each row, after splitting the repeated labels based on ,.

In other words, I would expect a result of length 3 with values of 2 2 1.

I wrote a function to do this, but it only seems to work outside of sapply:

m_fn <- function(x,y) length(union(unlist(sapply(x, str_split,",")), 
                             unlist(sapply(y, str_split,","))))

m_fn(ml.mat2$ground_truth[1], y = ml.mat2$predicted[1])

[1] 2

m_fn(ml.mat2$ground_truth[2], y = ml.mat2$predicted[2])

[1] 2

m_fn(ml.mat2$ground_truth[3], y = ml.mat2$predicted[3])

[1] 1

Rather than iterating through the rows of the data set manually like this or with a loop, I would expect to be able to vectorize the solution with sapply like this:

sapply(ml.mat2$ground_truth, m_fn, ml.mat2$predicted)

However, the unexpected results are:

label1, label3         label2         label1 
             4              3              3

Solution

  • Since you're interating within same observation size, you can generate an index of row numbers and run it in your sapply:

    sapply(1:nrow(ml.mat2), function(i) m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i])) 
    
    #[1] 2 2 1
    

    or with seq_len:

    sapply(seq_len(nrow(ml.mat2)), function(i) 
      m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))