Search code examples
mergestatamergesort

Merging two datasets with variables in one corresponding observations in the other


I ran into a challenge today and I hope to get some help. I want to merge 2 data sets. Dataset1 contains the member roster with 3 variables, the batch_id, member_num in the batch and occupation. Dataset2 consists of member license statuses.

The challenge here is that in dataset2 the member_num is represented as a variable in a way that member_num_x in dataset2 corresponds to "x" observation under variable member_num in dataset1. I need to merge these two datasets so that in the end I have one dataset that has batch_id, member_num, occupation, and license status for each member.


Dataset1
| batch_id   | member_num | occupation |
| --------   | --------   |  -------- 
| A01        |   1        | Driver     |
| A01        |   2        | Driver     |
| A01        |   3        | Driver     |
| A01        |   4        | Driver     |
| A02        |   1        | Navigator  |
| A02        |   2        | Navigator  |


Dataset2
| batch_id |member_num_1|member_num_2|member_num_3|member_num_4|
| -------- | --------   | --------   | --------   | --------   |         
| A01      | Yes        |   NA       |   Yes      |   No       | 
| A02      | No    |    |   NA       |


Desired Output 

| batch_id   | member_num | occupation | License_status
| --------   | --------   |  -------- 
| A01        |   1        | Driver     | Yes
| A01        |   2        | Driver     | NA
| A01        |   3        | Driver     | Yes
| A01        |   4        | Driver     | No
| A02        |   1        | Navigator  | No
| A02        |   2        | Navigator  | NA

I have tried using the merge command in Stata but there is no option to do this particular kind of merging. The options that are there use unique variables (almost same as joins on primary keys).


Solution

  • You need to reshape d2 into long format and then merge/link on batch_id, member_num

    Approach 1 (using frames)

    clear
    use d1
    frame create d2
    frame d2: use d2
    frame d2: reshape long member_num_, i(batch_id) j(member_num)
    frlink 1:1 batch_id member_num, frame(d2)
    frget License_status = member_num_, from(d2)
    

    Approach 2 (using merge)

    clear
    use d2
    reshape long member_num_, i(batch_id) j(member_num)
    rename member_num_ License_status
    tempfile d2long
    save `d2long',replace
    use d1,clear
    merge 1:1 batch_id member_num using `d2long',nogenerate keep(1 3)
    

    Output:

           batch_id   member~m   occupat~n   d2   Licens~s  
      1.        A01          1      Driver    1        Yes  
      2.        A01          2      Driver    2         NA  
      3.        A01          3      Driver    3        Yes  
      4.        A01          4      Driver    4         No  
      5.        A02          1   Navigator    5         No  
      6.        A02          2   Navigator    6         NA
    

    Input:

    d1.dta:

           batch_id   member~m   occupat~n  
      1.        A01          1      Driver  
      2.        A01          2      Driver  
      3.        A01          3      Driver  
      4.        A01          4      Driver  
      5.        A02          1   Navigator  
      6.        A02          2   Navigator  
    

    d2.dta:

           batch_id   member~1   member~2   member~3   member~4  
      1.        A01        Yes         NA        Yes         No  
      2.        A02         No         NA