r dataframe text iteration cross-reference

Cross-referencing columns in two data frames with different lengths

New to R here, driving myself crazy trying to do something that I feel should be simple - thought I'd consult the hive mind before continuing to bang my head against my own limitations. Details:

Two dataframes (A and B) are created from different CSVs of different lengths. A has 221 rows, and B has 1355 rows. For troubleshooting purposes, dummy versions of these dataframes might look like these:

A <- data.frame(Datasets = c("A", "B", "C"), in_B = c(NA, NA, NA))
B <- data.frame(Dataset = c("A", "B", "H", "D", "E", "F", "G"), irrelevent_column = NA)

The Datasets/Dataset fields above contain text - specifically, each record contains the name of a dataset within a database.

My goal is to iterate through A$Datasets to see if all of values in the Datasets column are also in B$Dataset. I'm looking for exact matches between entries, partial matches will give me a lot of false returns. I want to track this in A$in_B, writing "Yes" if there is a match and "No" if not. Based on the above example, the output should look like this:

      Datasets     in_B
1       A          Yes
2       B          Yes
3       C          No

I'm sure it's something silly that I'm missing, but I'm definitely struggling here - any help would be greatly appreciated!!!

My current (failed) attempt looks like this:

    for (row in nrow(A)){
      if (A$Datasets %in% B$Dataset){
        A$in_B == "Yes"
      } else {
        A$in_B == "No"
      }
    }

Instead of getting the intended result above,m I got the following error:

**Error: the condition has length > 1**

I've also gotten the following error while attempting different approaches with the actual data, but not with the test example created above - I'm not sure why, and can't seem to replicate it with test data:

$ operator is invalid for atomic vectors

Solution

Problems:

== is a test of equality, it can never be used for assignment. Use <- or =.
if (cond) ..., the cond must always be exactly length 1, never less never more. In an if statement, it is generally a single-value comparison (or such) that returns a single value, if you need to test with something that returns other than 1 then you need to aggregate it somehow, such as all(..) or any(..) or sum(..), etc.
You have a row-wise loop using row as a variable but you never reference row. You should be referencing it inside your tests and assignments.
row in nrow(A) will only ever work on the last row, since nrow(A) returns a single number. If you want to iterate over all rows, then something like 1:nrow(A) or seq_len(nrow(A)) (the latter is better programmatically if A might have 0 rows, generally not a factor interactively).

If I fixed the for/if code literally (which I don't recommend), I might use

for (row in seq_len(nrow(A))) {
  if (A$Datasets[row] %in% B$Dataset) {
    A$in_B[row] <- "Yes"
  } else {
    A$in_B[row] <- "No"
  }
}

But in the end, R will really perform much better here (and it might be easier to read and maintain as well) if you do things on a whole vector/column at a time.

A$in_B <- A$Datasets %in% B$Dataset

This makes the column a logical one with TRUE and FALSE, I generally recommend those instead of "Yes"/"No" ... but those are also easy to use if you really prefer:

A$in_B <- ifelse(A$Datasets %in% B$Dataset, "Yes", "No")