Search code examples
rdataframetextiterationcross-reference

Cross-referencing columns in two data frames with different lengths


New to R here, driving myself crazy trying to do something that I feel should be simple - thought I'd consult the hive mind before continuing to bang my head against my own limitations. Details:

Two dataframes (A and B) are created from different CSVs of different lengths. A has 221 rows, and B has 1355 rows. For troubleshooting purposes, dummy versions of these dataframes might look like these:

A <- data.frame(Datasets = c("A", "B", "C"), in_B = c(NA, NA, NA))
B <- data.frame(Dataset = c("A", "B", "H", "D", "E", "F", "G"), irrelevent_column = NA)

The Datasets/Dataset fields above contain text - specifically, each record contains the name of a dataset within a database.

My goal is to iterate through A$Datasets to see if all of values in the Datasets column are also in B$Dataset. I'm looking for exact matches between entries, partial matches will give me a lot of false returns. I want to track this in A$in_B, writing "Yes" if there is a match and "No" if not. Based on the above example, the output should look like this:

      Datasets     in_B
1       A          Yes
2       B          Yes
3       C          No

I'm sure it's something silly that I'm missing, but I'm definitely struggling here - any help would be greatly appreciated!!!

My current (failed) attempt looks like this:

    for (row in nrow(A)){
      if (A$Datasets %in% B$Dataset){
        A$in_B == "Yes"
      } else {
        A$in_B == "No"
      }
    }

Instead of getting the intended result above,m I got the following error:

**Error: the condition has length > 1**

I've also gotten the following error while attempting different approaches with the actual data, but not with the test example created above - I'm not sure why, and can't seem to replicate it with test data:

$ operator is invalid for atomic vectors

Solution

  • Problems:

    1. == is a test of equality, it can never be used for assignment. Use <- or =.
    2. if (cond) ..., the cond must always be exactly length 1, never less never more. In an if statement, it is generally a single-value comparison (or such) that returns a single value, if you need to test with something that returns other than 1 then you need to aggregate it somehow, such as all(..) or any(..) or sum(..), etc.
    3. You have a row-wise loop using row as a variable but you never reference row. You should be referencing it inside your tests and assignments.
    4. row in nrow(A) will only ever work on the last row, since nrow(A) returns a single number. If you want to iterate over all rows, then something like 1:nrow(A) or seq_len(nrow(A)) (the latter is better programmatically if A might have 0 rows, generally not a factor interactively).

    If I fixed the for/if code literally (which I don't recommend), I might use

    for (row in seq_len(nrow(A))) {
      if (A$Datasets[row] %in% B$Dataset) {
        A$in_B[row] <- "Yes"
      } else {
        A$in_B[row] <- "No"
      }
    }
    

    But in the end, R will really perform much better here (and it might be easier to read and maintain as well) if you do things on a whole vector/column at a time.

    A$in_B <- A$Datasets %in% B$Dataset
    

    This makes the column a logical one with TRUE and FALSE, I generally recommend those instead of "Yes"/"No" ... but those are also easy to use if you really prefer:

    A$in_B <- ifelse(A$Datasets %in% B$Dataset, "Yes", "No")