Search code examples
many-to-manyspssouter-join

How to perform a many-to-many or (at least) a outer-join in SPSS


usually I use [R] for my data analysis, but these days I have to use SPSS. I was expecting that data manipulation might get a little bit more difficult this way, but after my first day I kind of surrender :D and I really would appreciate some help ... My problem is the following:

  • I have two data sets, which have an ID number. Neither data sets have a unique ID (in one data set, which should have unique IDs, there is kind of a duplicated row)
  • In a perfect world I would like to keep this duplicated row and simply perform a many-to-many-join. But I accepted, that I might have to delete this "bad" row (in dataset A) and perform a 1:many-join (join dataset B to dataset A, which contains the unique IDs).
  • If I run the join (and accept that it seems not to be possible to run a 1:many, but only a many:1-join), I have the problem, that I lose IDs. If I join dataset A to dataset B I lose all cases, that are not part of dataset B. But I really would like to have both IDs like in a full join or something.

Do you know if there is (kind of) a simple solution to my problem?

Example:

dataset A:

ID VAL1
1 A
1 B
2 D
3 K
4 A

dataset B:

ID VAL2
1 g
2 k
4 a
5 c
5 d
5 a
2 x

expected result (best solution):

ID VAL1 VAL2
1 A g
1 B g
2 D k
3 K NA
4 A a
2 D x

expected result (second best solution):

ID VAL1 VAL2
1 A g
2 D k
3 K NA
4 A a
5 NA c
5 NA d
5 NA a
2 D x

what I get (worst solution):

ID VAL1 VAL2
1 A g
2 D k
4 A a
5 NA c
5 NA d
5 NA a
2 D x

Solution

  • From your example It looks like what you need is a full many to many join, based on the ID's existing in dataset A. You can get this by creating a full Cartesian-Product of the two dataset, using dataset A as the first\left dataset.
    The following syntax assumes you have the STATS CARTPROD extention command installed. If you don't you can see here about installing it.

    First I'll recreate your example to demonstrate on:

    dataset close all.
    data list list/id1  vl1  (2F3) .
    begin data 
    1   232
    1   433
    2   456
    3   246
    4   468
    end data.
    dataset name aaa.
    
    data list list/id2  vl2  (2F3) .
    begin data 
    1   111
    2   222
    4   333
    5   444
    5   555
    5   666
    2   777
    3   888
    end data.
    dataset name bbb.
    

    Now the actual work is fairly simple:

    DATASET ACTIVATE aaa.
    STATS CARTPROD VAR1=id1 vl1 INPUT2=bbb VAR2=id2 vl2
    /SAVE OUTFILE="C:\somepath\yourcartesianproduct.sav".
    
    * The new dataset now contains all possible combinations of rows in the two datasets.
    * we will select only the relevant combinations, where the two ID's match.
    select if id1=id2.
    exe.