Search code examples
runiquelong-integer

how many people received 4 drugs of interest? R


I have a long list of people receiving drugs coded in the variable ATC. I want to find out how many people have used 4 specific drugs. For example, I want to count how many people have used this particular pattern of drugs "C07ABC" & "C09XYZ" &"C08123" &"C03ZOO". Some people may have used some agents (eg C07 or C08) more than once, thats ok, I just want to count how many unique people had the regimen I'm interested in. I don't care how many times they had the unique drugs. However, because I have various patterns that I want to look up - I would like to use the grepl function. To explain this further, my first attempt at this problem tried a sum command:

sum(df[grepl('^C07.*?'|'^C09.*?'|'^C08.*?|C03.*?', as.character(df$atc)),])

However this doesn't work because I think the sum command needs a boolean function. ALso, I think the | sign isn't correct here either (I want an &) but I'm just showing the code so that you know what I'm after. Maybe an ave function is what I need - but am unsure of how I would code this?

Thanks in advance.

df
   names  fruit      dates     atc
4   john   kiwi 2010-07-01  C07ABC
7   john  apple 2010-09-01  C09XYZ
9   john banana 2010-11-01  C08123
13  john orange 2010-12-01  C03ZOO
14  john  apple 2011-01-01  C07ABC
2   mary orange 2010-05-01  C09123
5   mary  apple 2010-07-01  C03QRT
8   mary orange 2010-07-01  C09ZOO
10  mary  apple 2010-09-01  C03123
12  mary  apple 2010-11-01  C09123
1    tom  apple 2010-02-01  C03897
3    tom banana 2010-03-01 C02CAMN
6    tom  apple 2010-06-01  C07123
11   tom   kiwi 2010-08-01 C02DA12

Solution

  • In addition to not needing to deliver entire dataframe lines to sum you also had extra quote marks in that pattern:

    > sum( grepl('^C07.*|^C09.*|^C08.*|C03.*', df$atc) )
    [1] 12
    

    I think this is easier to read:

    >  sum( grepl('^(C07|C09|C08|C03).*', df$atc) )
    [1] 12
    

    But now I read that you want all of thos used and to do the calculation within a patient id. That might have requiree using & as the connector but I decide to try a different route and use unique and then count then number of unique matches while doing it within an aggregate operation.

    > aggregate(atc ~ names, data=df, 
                           function(drgs) length(unique(grep('^(C07|C09|C08|C03)', drgs))))
      names atc
    1  john   5
    2  mary   5
    3   tom   2
    

    Although that's the number of matching items but not the number of unique items, because I forgot to put value=TRUE in the grep call (and also need to use substr to avoid separately counting congeners with different trailing ATC codes):

    > aggregate(atc ~ names, data=df, function(drgs) length(unique(grep('^C0[7983]', substr(drgs,1,3), value=TRUE))))
      names atc
    1  john   4
    2  mary   2
    3   tom   2
    

    This would be somewhat similar to @MichaelLawrence's matrix/table approach, but I think it would scale better since the "tables" being created would be much smaller:

     combo <- c("C07", "C09", "C08", "C03")
     tapply(df$atc, df$names, function(drgs) sum(combo %in% substr(drgs,1,3)) )
    #------
    john mary  tom 
       4    2    2