Search code examples
mergesasmissing-datawide-format-data

How can I collapse repeated missing observations into a single nonmissing observation for the same ID in SAS?


I have a large dataset with hundreds of variables collected over several timepoints. The variables are already defined by timepoint, but each observation is a different timepoint. It's like the dataset was planned in wide format but collected in long format, like this:

data have;
input id timepoint $ var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 time_1 1 . . note1 . .
1 time_2 . 2 . . note2 .
1 time_3 . . 3 . . note3
2 time_1 1 . . note1 . .
2 time_2 . 2 . . note2 .
2 time_3 . . 3 . . note3
;
run;

The variable timepoint is redundant; the variables already describe the timepoint. I need to collapse the dataset into a single observation per id (into just the standard wide format), like this:

data want;
input id var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 1 2 3 note1 note2 note3
2 1 2 3 note1 note2 note3
;
run;

Note: Unfortunately, variable names do not always end in _t1, _t2, etc. as in my example (e.g., note_t1_2) so I cannot easily reference suffixes.

My first thought was to break up the dataset into separate visit datasets (data timepoint_1_have; set have; if timepoint = "time_1";, etc.), and then merge them by id. A straight merge resulted in loss of data (I'm sure that was obvious, but I thought maybe the missing values would be overwritten). So I thought, I'll drop all the variables with only missing values before merging by id. This proved incredibly difficult and I cannot find a way to do this for both character and numeric variables without pages of macros...

So I tried different strategies using retain. Since first.id for variables at timepoints >1 have missing values, I thought retaining instances of nonmissing values down timepoints and then keeping last.id might work:

data timepoint_1_want;
set timepoint_1_have;
array Nums[*] _numeric_;
array Chars[*] _character_;
by id;
do i = 1 to dim(Nums);
    if not missing(Nums[i]) then do;
        retain Nums[i];
    end;
do i = 1 to dim(Chars);
    if not missing(Chars[i]) then do;
        retain Chars[i];
    end;
drop i;
if last.id then output;
run;

But retain can't be used within a do loop, so this didn't work either:

12155      data timepoint_1_want;
12156      set timepoint_1_have;
12157      array Nums[*] _numeric_;
12158      array Chars[*] _character_;
12159      by id;
12160      do i = 1 to dim(Nums);
12161          IF not missing(Nums[i]) THEN do;
12162              retain Nums[i];
                              -
                              22
                              76
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
              a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
              _CHARACTER_, _CHAR_, _NUMERIC_.

ERROR 76-322: Syntax error, statement will be ignored.

12163          end;
12164      do i = 1 to dim(Chars);
12165          IF not missing(Chars[i]) THEN do;
12166              retain Chars[i];
                               -
                               22
                               76
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
              a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
              _CHARACTER_, _CHAR_, _NUMERIC_.

ERROR 76-322: Syntax error, statement will be ignored.

12167          end;
12168      drop i;
12169      IF last.id THEN output;
12170      run;

Solution

  • You can use the update trick.

    data have;
    input id timepoint $ var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
    datalines;
    1 time_1 1 . . note1 . .
    1 time_2 . 2 . . note2 .
    1 time_3 . . 3 . . note3
    2 time_1 1 . . note1 . .
    2 time_2 . 2 . . note2 .
    2 time_3 . . 3 . . note3
    ;
    run;
    
    data want;
       update have(keep=id obs=0) have(drop=timepoint);
       by id;
       run;
    

    enter image description here