Search code examples
pythonmergesasconcatenation

Is there any equivalent of SAS merging in python?


I am converting SAS codes to python but I could not find any equivalent of this joining method in Python ?

data have_a;
input ID amount_a;
datalines;
1 10
1 15
1 20
1 30
2 10
; 

data have_b;
   input ID amount_b;
   datalines;
1 12
1 14
2 12
2 14
;

data want;
  merge have_a(in=inA) have_b(in=inb);
  by id;
  if inA and inB;
run;

output;

ID  amount_a  amount_b
1   10  12
1   15  14
1   20  14
1   30  14
2   10  12
2   10  14

I have tried pd.concat and merge methods but I could not obtain a result above. Is there any direct method to do it in Python ? I tried;

df1 = pd.DataFrame({'ID': [1, 1, 1, 1, 2], 'A': [10, 15, 20, 30, 10]})

df2 = pd.DataFrame({'ID': [1, 1, 2, 2], 'B': [12, 14, 11, 13]})
pd.concat([df1, df2], axis=1)

output;

 ID   A   ID     B
0   1  10  1.0  12.0
1   1  15  1.0  14.0
2   1  20  2.0  11.0
3   1  30  2.0  13.0
4   2  10  NaN   NaN

Solution

  • If you don't need to carry forward the last observation's value from the shorter dataset then just add a counter per group to each dataset (data frame) and include that variable in the join criteria.

    enter image description here

    But it will make the subsetting criteria harder (at least in SAS code) since you want to keep all records for groups that have any contributions from both inputs. The IN= flags will not work directly since for the tail end of the group only one dataset will be contributing. You would have to retain the IN= flag variables' values from the first observation in the group. Or you could filter based on whether the ID value is in the intersection of the distinct ID values from both datasets.

    I will leave generating the actual python code to implement that logic to others.