Hi I have some data I am reading in from a csv, which is set out in binary form:
1 2 3 4...N
1 0 1 0 1...1
2 1 1 0 1...1
3 0 0 0 0...0
4 1 0 1 1...1
. 1 1 1 0...1
. 1 0 0 0...1
N 0 0 1 1...0
I want to take a subset of this data where the sum of the row vectors is greater than a number say 10, or x. The first column is a placeholder column for customer ID, so this needs to be excluded. Do you have any suggestions about how I could go about doing this?
I've been trying various things like df=subset()
but I've not been able to get the syntax correct.
Thanks in advance.
We can do this with rowSums
df1[rowSums(df1) > 10, , drop = FALSE]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
#7 0 0 0 1 0 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1
#9 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1
In the OP's dataset, the first column 'X' is not binary and have bigger numbers. So, when we include that variable, the rowSums
would be greater than 10. It is the index ID and not to be used in the calculation. So, by removing it in the rowSums
, it would subset well
df1[rowSums(df1[-1])> 10,]
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:1, 10* 20, replace = TRUE), ncol = 20))