d3:
Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22
I am using this loop:
for ( i in 1:(nrow (d3)-1) ){
for (j in (i+1):nrow(d3)) {
if(c(i) == c(j)) {
print(c(j))
# d4 <- subset.data.frame(c(j))
}
}
}
I want to compare all the rows in Col1 and eliminate the ones that are not the same. Then I want to output a data frame with only the ones that have the same values in col1.
Expected Output:
Col1 Col2
PBR565 22
PBR565 22
PBR565 22
Not sure whats up with my nested loop? Is it because I don't specify the col names?
The OP has requested to compare all the rows in Col1
and eliminate the ones that are not the same.
If I understand correctly, the OP wants to remove all rows where the value in Col1
appears only once and to keep only those rows where the values appears two or more times.
This can be accomplished by finding duplicated values in Col1
. The duplicated()
function marks the second and subsequent appearences of a value as duplicated. Therefore, we need to scan forward and backward and combine both results:
d3[duplicated(d3$Col1) | duplicated(d3$Col1, fromLast = TRUE), ]
Col1 Col2 2 PBR565 22 3 PBR565 22 4 PBR565 22
The same can be achieved by counting the appearances using the table()
function as suggested by Ryan. Here, the counts are filtered to keep only those entries which appear two or more times.
t <- table(d3$Col1)
d3[d3$Col1 %in% names(t)[t >= 2], ]
Please, note that this is different from Ryan's solution which keeps only the rows whose value appears most often. Only one value is picked, even in case of ties. (For the given small sample dataset both approaches return the same result.)
Ryan's answer can be re-written in a slightly more concise way
d3[d3$Col1 == names(which.max(t)), ]
d3 <- data.table::fread(
"Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22", data.table = FALSE)