I have a long list of people receiving drugs coded in the variable ATC. I want to find out how many people have used 4 specific drugs. For example, I want to count how many people have used this particular pattern of drugs "C07ABC" & "C09XYZ" &"C08123" &"C03ZOO". Some people may have used some agents (eg C07 or C08) more than once, thats ok, I just want to count how many unique people had the regimen I'm interested in. I don't care how many times they had the unique drugs. However, because I have various patterns that I want to look up - I would like to use the grepl function. To explain this further, my first attempt at this problem tried a sum command:
sum(df[grepl('^C07.*?'|'^C09.*?'|'^C08.*?|C03.*?', as.character(df$atc)),])
However this doesn't work because I think the sum command needs a boolean function. ALso, I think the | sign isn't correct here either (I want an &) but I'm just showing the code so that you know what I'm after. Maybe an ave function is what I need - but am unsure of how I would code this?
Thanks in advance.
df
names fruit dates atc
4 john kiwi 2010-07-01 C07ABC
7 john apple 2010-09-01 C09XYZ
9 john banana 2010-11-01 C08123
13 john orange 2010-12-01 C03ZOO
14 john apple 2011-01-01 C07ABC
2 mary orange 2010-05-01 C09123
5 mary apple 2010-07-01 C03QRT
8 mary orange 2010-07-01 C09ZOO
10 mary apple 2010-09-01 C03123
12 mary apple 2010-11-01 C09123
1 tom apple 2010-02-01 C03897
3 tom banana 2010-03-01 C02CAMN
6 tom apple 2010-06-01 C07123
11 tom kiwi 2010-08-01 C02DA12
In addition to not needing to deliver entire dataframe lines to sum
you also had extra quote marks in that pattern:
> sum( grepl('^C07.*|^C09.*|^C08.*|C03.*', df$atc) )
[1] 12
I think this is easier to read:
> sum( grepl('^(C07|C09|C08|C03).*', df$atc) )
[1] 12
But now I read that you want all of thos used and to do the calculation within a patient id. That might have requiree using & as the connector but I decide to try a different route and use unique
and then count then number of unique matches while doing it within an aggregate
operation.
> aggregate(atc ~ names, data=df,
function(drgs) length(unique(grep('^(C07|C09|C08|C03)', drgs))))
names atc
1 john 5
2 mary 5
3 tom 2
Although that's the number of matching items but not the number of unique items, because I forgot to put value=TRUE in the grep
call (and also need to use substr
to avoid separately counting congeners with different trailing ATC codes):
> aggregate(atc ~ names, data=df, function(drgs) length(unique(grep('^C0[7983]', substr(drgs,1,3), value=TRUE))))
names atc
1 john 4
2 mary 2
3 tom 2
This would be somewhat similar to @MichaelLawrence's matrix/table approach, but I think it would scale better since the "tables" being created would be much smaller:
combo <- c("C07", "C09", "C08", "C03")
tapply(df$atc, df$names, function(drgs) sum(combo %in% substr(drgs,1,3)) )
#------
john mary tom
4 2 2