Search code examples
sasmedian

Calculating median across multiple rows and columns in SAS 9.4


I tried searching multiple places but have not been able to find a solution yet. I was wondering if someone here would be able to please help me?

I am trying to calculate a median value (with Q1 and Q3) across multiple rows and columns in SAS 9.4 The dataset I am working with looks like the following:

Obs tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4
1   4            1.5          1            1
2   2.5          2            .            .
3   3            .            .            .
4   4            .            .            .
5   3.5          1            .            .

The context is this is for a medical condition where a person may have 1 (or more) tumors. Each row represents 1 person. Each person may have up to 4 tumors. I would like to determine the median size of all tumors for the entire cohort (not just the median size for each person). Is there a way to calculate this? Thank you in advance.


Solution

  • A transpose of the data will yield a data structure (form) that is amenable to median and quartile computations, at a variety of aggregate combinations, made with PROC SUMMARY and a CLASS statement.

    Example:

    data have;
    input
    patient tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4; datalines;
    1   4            1.5          1            1
    2   2.5          2            .            .
    3   3            .            .            .
    4   4            .            .            .
    5   3.5          1            .            .
    ;
    
    proc transpose data=have out=new_have;
    by patient;
    var tumor:;
    run;
    
    proc summary data=new_have;
      class patient;
      var col1;
      output out=want Q1=Q1 Q3=Q3 MEDIAN=MEDIAN N=N;
    run;
    

    Results

    patient    _TYPE_    _FREQ_    Q1     Q3     MEDIAN     N
    
       .          0        20       1    3.50     2.25     10
       1          1         4       1    2.75     1.25      4
       2          1         4       2    2.50     2.25      2
       3          1         4       3    3.00     3.00      1
       4          1         4       4    4.00     4.00      1
       5          1         4       1    3.50     2.25      2
    

    The _TYPE_ column describes the ways in which the CLASS variables are combined in order to achieve the results for the requested statistics. The _TYPE_ = 0 case is for all values, and, in this problem, the _FREQ_ = 20 indicates 20 inputs went into the computation consideration, and that N = 10 of those were non-missing and were involved in the actual computation. The role of _TYPE_ becomes more obvious when there is more than one CLASS variable.

    From the Output Data Set documentation:

    • the variable _TYPE_ that contains information about the class variables. By default _TYPE_ is a numeric variable. If you specify CHARTYPE in the PROC statement, then _TYPE_ is a character variable. When you use more than 32 class variables, _TYPE_ is automatically a character variable.

    and

    The value of _TYPE_ indicates which combination of the class variables PROC MEANS uses to compute the statistics. The character value of _TYPE_ is a series of zeros and ones, where each value of one indicates an active class variable in the type. For example, with three class variables, PROC MEANS represents type 1 as 001, type 5 as 101, and so on.

    A far less elegant way to compute the median of all is to store all the values in an oversized array and use the MEDIAN function on the array after the last row is read in:

    data median_all;
      set have end=lastrow;
      array values [1000000] _temporary_;
      array sizes tumor_size_1-tumor_size_4;
    
      do sIndex = 1 to dim(sizes);
    /*    if not missing (sizes[sIndex]) then do; */  %* decomment for dense fill;
          vIndex + 1;
          values[vIndex] = sizes[sIndex];
    /*    end; */                                     %* decomment for dense fill;
      end;
    
      if lastrow then do;
        median_all_tumor_sizes = median (of values(*));
        output;
        put (median:) (=);
      end;
    
      keep median:;
    run;
    
    -------- LOG -------
    median_all_tumor_sizes=2.25