New to R here, driving myself crazy trying to do something that I feel should be simple - thought I'd consult the hive mind before continuing to bang my head against my own limitations. Details:
Two dataframes (A and B) are created from different CSVs of different lengths. A has 221 rows, and B has 1355 rows. For troubleshooting purposes, dummy versions of these dataframes might look like these:
A <- data.frame(Datasets = c("A", "B", "C"), in_B = c(NA, NA, NA))
B <- data.frame(Dataset = c("A", "B", "H", "D", "E", "F", "G"), irrelevent_column = NA)
The Datasets/Dataset fields above contain text - specifically, each record contains the name of a dataset within a database.
My goal is to iterate through A$Datasets to see if all of values in the Datasets column are also in B$Dataset. I'm looking for exact matches between entries, partial matches will give me a lot of false returns. I want to track this in A$in_B, writing "Yes" if there is a match and "No" if not. Based on the above example, the output should look like this:
Datasets in_B
1 A Yes
2 B Yes
3 C No
I'm sure it's something silly that I'm missing, but I'm definitely struggling here - any help would be greatly appreciated!!!
My current (failed) attempt looks like this:
for (row in nrow(A)){
if (A$Datasets %in% B$Dataset){
A$in_B == "Yes"
} else {
A$in_B == "No"
}
}
Instead of getting the intended result above,m I got the following error:
**Error: the condition has length > 1**
I've also gotten the following error while attempting different approaches with the actual data, but not with the test example created above - I'm not sure why, and can't seem to replicate it with test data:
$ operator is invalid for atomic vectors
Problems:
==
is a test of equality, it can never be used for assignment. Use <-
or =
.if (cond) ...
, the cond
must always be exactly length 1, never less never more. In an if
statement, it is generally a single-value comparison (or such) that returns a single value, if you need to test with something that returns other than 1 then you need to aggregate it somehow, such as all(..)
or any(..)
or sum(..)
, etc.row
as a variable but you never reference row
. You should be referencing it inside your tests and assignments.row in nrow(A)
will only ever work on the last row, since nrow(A)
returns a single number. If you want to iterate over all rows, then something like 1:nrow(A)
or seq_len(nrow(A))
(the latter is better programmatically if A
might have 0 rows, generally not a factor interactively).If I fixed the for
/if
code literally (which I don't recommend), I might use
for (row in seq_len(nrow(A))) {
if (A$Datasets[row] %in% B$Dataset) {
A$in_B[row] <- "Yes"
} else {
A$in_B[row] <- "No"
}
}
But in the end, R will really perform much better here (and it might be easier to read and maintain as well) if you do things on a whole vector/column at a time.
A$in_B <- A$Datasets %in% B$Dataset
This makes the column a logical
one with TRUE
and FALSE
, I generally recommend those instead of "Yes"
/"No"
... but those are also easy to use if you really prefer:
A$in_B <- ifelse(A$Datasets %in% B$Dataset, "Yes", "No")