I'm trying to identify 'unique' and 'near unique' cases or records from a dataset for a disclosure control project. Particularly combinations on certain variables that only appear once, twice etc.
The records appear in:
table(Age,Sex,Ethnicity)
I am most interested in the elements (which are TRUE) of:
table(Age,Sex,Ethnicity)==1
table(Age,Sex,Ethnicity)==2
I know there are roughly 150 cases I am looking for from:
sum(table(Age,Sex,Ethnicity)==1)
there is also an identifier in the dataset which be a nice output or a number from 1:length(Age)*length(Sex)*length(Ethnicity) would be just as good. I was hoping it return a list something like:
[1] 103 207 218....
[41] * * *
[81] * * *
where 'identifier' = 103, 207 and 218 for the first 3 of 150 cases where:
table(Age,Sex,Ethnicity)==1
I was naively hoping that something like:
data$identifier[table(age,sex,mar,emp,edu) == 1]
names(table(Age,Sex,Ethnicity)
would work but no such luck. I've also looked into unique() but that returns every combination (that occurs once or more). Any help or input would be much appreciated.
Added reproducible example (hopefully) Example
set.seed(1234)
a <- 1+rpois(100,1)
b <- 1+rpois(100,1)
c <- 1+rpois(100,1)
a[a >= 5] <- 4
b[b >= 5] <- 4
c[c >= 5] <- 4
eg <- cbind(1:100,a,b,c)
(sum(table(a,b,c)==1))
should have 12 'unique' combinations, which I would like to identify using the first column of eg (or the identifier from the dataset)
I think the easiest way to do this is using the data.table
package:
library(data.table)
eg.dt <- as.data.table(eg)
eg.dt[, list(N=.N), by=.(a,b,c)][N==1]
How this works: eg.dt[, list(N=.N), by=.(a,b,c)]
counts up the number of occurences of each (a,b,c) combination. The [N==1]
filters out those that occur precisely once.
Or if you want to stick with dataframes (not data.table) try plyr
:
library(plyr)
eg <- data.frame(eg)
subset(ddply(eg, .(a, b, c), nrow), V1 == 1)
This works in the same way: the ddply(eg, .(a, b, c), nrow)
makes a dataframe with a column "V1" being the number of times that combination occurs; then you just subset it for combinations that occur once only.
I think there might be a way to do it with your table(a,b,c)
but I can't think of one that isn't convoluted.