Search code examples
uniquestatacategories

Identify unique levels of categorical variable


I have a list of person ids, and the types of medicines they got on specific dates.

I would like to create a variable count whereby I can give the indicator 1 to the first drug that occurs, 2 to the second unique drug and 3 to the third unique drug. When the first drug occurs after the second and third, I want it to still have the indicator 1. Likewise for unique drug 2, it should maintain the value 2 throughout the person's whole medication history, and the same for drug 3.

     +-------------------------------------+
     | p_id      date      agent_~e  count |
     |-------------------------------------|
 38. |  1001   13dec2001   thiazide       1|
 39. |  1001   12apr2002   thiazide       1|
 40. |  1001   15jul2002   thiazide       1|
 41. |  1001   28aug2002        arb       2|
 42. |  1001   26sep2002        CCB       3|
     |-------------------------------------|
 43. |  1001   26sep2002        arb       2|
 44. |  1001   10oct2002        CCB       3|
 45. |  1001   10oct2002   thiazide       1|
 46. |  1001   10oct2002        arb       2|
 47. |  1001   10dec2002        CCB       3|
     |-------------------------------------|
 48. |  1001   10dec2002        arb       2|
     +-------------------------------------+

Because each person has a different set of drugs, I think I need quite a general solution as opposed to something like

gen count = 1 if agent_type == "thiazide". 

For example, person two is below and they have a very different drug history to person one above.

         +-------------------------------+
         | p_id    date        agent_t~e |
         |-------------------------------|
    207. |  2001   08jul1999   ace_inhib |
    208. |  2001   02aug1999   ace_inhib |
    209. |  2001   25aug1999   ace_inhib |
    210. |  2001   22oct1999   ace_inhib |
    211. |  2001   18nov1999         CCB |
         |-------------------------------|
    212. |  2001   18nov1999   ace_inhib |
    213. |  2001   14dec1999         CCB |
    214. |  2001   12jan2000         CCB |
    215. |  2001   03feb2000         CCB |
    216. |  2001   03feb2000         arb |
         |-------------------------------|
    217. |  2001   02mar2000         CCB |
         +-------------------------------+

Solution

  • "Unique" is a common misnomer here; strictly, it means occurring once only, which is not what you mean at all. "Distinct" is a much better word: for a discussion in Stata context, see here.

    Please find out about dataex from SSC to be able to show data examples that can be copied and pasted directly. Yours required some engineering to be made easy to use.

    Your problem is already a Stata FAQ found here. It is a good idea to look through the FAQs before posting.

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float p_id str8 agent_type float(wanted date)
    1001 "thiazide" 1 15322
    1001 "thiazide" 1 15442
    1001 "thiazide" 1 15536
    1001 "arb"      2 15580
    1001 "CCB"      3 15609
    1001 "arb"      2 15609
    1001 "CCB"      3 15623
    1001 "thiazide" 1 15623
    1001 "arb"      2 15623
    1001 "CCB"      3 15684
    1001 "arb"      2 15684
    2001 "ace_inhi" 1 14433
    2001 "ace_inhi" 1 14458
    2001 "ace_inhi" 1 14481
    2001 "ace_inhi" 1 14539
    2001 "CCB"      2 14566
    2001 "ace_inhi" 1 14566
    2001 "CCB"      2 14592
    2001 "CCB"      2 14621
    2001 "CCB"      2 14643
    2001 "arb"      3 14643
    2001 "CCB"      2 14671
    end
    format date %td 
    
    bysort p_id agent_type (date) : gen firstdate = date[1] 
    egen group = group(p_id firstdate agent_type) 
    bysort p_id (group date agent_type): gen count = sum(group != group[_n-1])  
    assert count == wanted 
    

    Note that the code takes care of the possibility that two or more drugs are first used on the same day by the same person.