Search code examples
arrayssasindicesdo-loops

array processing with different indices and missing values in SAS


have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:

id  p1_activities   p2_activities   p3_activities
A   1,2,3,4         1,3 
B   1,3             1,2,3           1,2,3
C                   1,2,3           1,2,3
D                   1,2,3   
E   1,2,3                           1

Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:

p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1            1            1            1            1            0            …
1            0            1            0            1            1            …
.            .            .            .            1            1            …
.            .            .            .            1            1            …
1            1            1            0            .            .            …

I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:

data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
       p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
       p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
       p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;

array plracts  {*} p1_activities p2_activities p3_activities;

array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;

do i=1 to dim(plracts);
do j=1 to dim(p1actflg);

         if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
    else if missing(plracts{i})                  then p1actflg{j}=.;
    else                                              p1actflg{j}=0;

end;
end;

*do this again for the other p#actflg arrays;

run;

My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.

How would you do this more systematically, and/or in far fewer lines of code?


Solution

  • Not sure why you are processing 4 activities for flags when there are only 3.

    Some ideas:

    • Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
      • activities_p1-activities_p3
    • Refactoring the flag column names to number suffixes
      • flag_p1_1-flag_p1_4
      • flag_p2_1-flag_p2_4
      • flag_p3_1-flag_p3_4
    • Use DIM to stay within array bounds.
    • Use two dimensional array for flags
    • Use direct addressing of items to be flagged
    • Add error checking

    Not fewer, but perhaps more robust ?

    This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):

    data want;
      set have;
      array activities
        activities_p1-activities_p3
      ;
      array flags(3,4) 
        flag_p1_1-flag_p1_4
        flag_p2_1-flag_p2_4
        flag_p3_1-flag_p3_4
      ;
    
      do i = 1 to dim(activites);
        if missing(activities[i]) then continue; %* skip;
        do j = 1 by 1;
          item = scan ( activities[i], j, ',' );
          if missing(item) then leave; %* no more items in csv list;
          item_num = input (item,?1.);
          if missing(item_num) then continue; %* skip, csv item is not a number;
          if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
            put 'WARNING:' item_num 'is invalid for flagging';
            continue; %* skip, csv item is missing, 0, negative or exceeds 4;
          end;
          flags (i, item_num) = 1;
        end;
        * backfill zeroes where flag not assigned;
        do j = 1 to hbound(flags,2);
          flags (i, item_num) = sum (0, flags (i, item_num));  %* sum() handles missing values;
        end;
      end;
    

    Here is the same processing, but only searching for specific items to be flagged:

    data have; length id activities_p1-activities_p3 $20;input 
    id  activities_p1-activities_p3 ; datalines;
    A   1,2,3,4         1,3             .
    B   1,3             1,2,3           1,2,3
    C   .               1,2,3           1,2,3
    D   .               1,2,3           .
    E   1,2,3           .               1
    ;
    data want;
      set have;
      array activities
        activities_p1-activities_p3
      ;
      array flags(3,4) 
        flag_p1_1-flag_p1_4
        flag_p2_1-flag_p2_4
        flag_p3_1-flag_p3_4
      ;
      do i = 1 to dim(activities);
        if not missing(activities[i]) then
        do j = 1 to hbound(flags,2);
          flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
        end;
      end;
    run;
    

    What's going on ?

    • flags variables are reset to missing at top of step
    • hbound return 4 as upper limit of second dimension
    • findw(trim(activities[i]),cats(j),',') find position of j in csv string
      • trim needed to remove trailing spaces which are not part of findw word delimiter list
      • cats converts j number to character representation
      • findw returns position of j in csv string.
        • might want to also compress out spaces and other junk if activity data values are not reliable.
      • first > 0 evaluates position to 0 j not present and 1 present
      • second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
    • flags(i,j) covers the 3 x 4 slots available for flagging.