Search code examples
sasleft-joinproc-sqlsas-studio

Left join a bucket value based on a greater than clause


I am looking to create an optimal bucketing macro. My first obstacle is to create equidistant buckets. I am using the sashelp.baseball dataset as an example.

I take the range of logsalary and divide this by 100 to create the distance between each bucket. Then I would like to assign the logsalary column a bucket value if the logsalary is smaller than the bucket value

The code I have tried is attached. I am hoping to be able to join or merge on the bucket limit values and use a greater than or smaller than clause to append a bucket value

/*Sort the baseball dataset by smallest to largest, removing any missing data*/
PROC SORT
    DATA = sashelp.baseball
        (KEEP = logsalary
         WHERE = (NOT MISSING(logsalary)))
    OUT  = baseball;
    BY logsalary;
RUN;

/*Identify the size of each bucket by splitting the range into 100 equidistant buckets*/
DATA _NULL_;
    RETAIN bin_size;
    SET baseball        END = EOF;

    IF _N_ = 1 THEN DO;
        bin_size = logsalary;
        CALL SYMPUT("min_bin",logsalary);
    END;
    IF EOF      THEN DO;
        bin_size = ((logsalary - bin_size) / 100);
        CALL SYMPUT("bin_size",bin_size);
    END;
RUN;

/*Create a vector to identify each bucket range*/
DATA bin_levels;
    DO bin = 1 TO 100;
        IF bin = 1 THEN DO;
            bin_level = &min_bin.;
            OUTPUT;
        END;
        ELSE DO;
            bin_level = &min_bin. + &bin_size. * bin;
            OUTPUT;
        END;
    END;
RUN;

/*Append a bucket number based on the logsalary being smaller than the next bucket value*/
PROC SQL;
    CREATE TABLE binned_data    AS
    SELECT
          a.*
        , b.bin
        , b.bin_level
    FROM
          baseball              a
    LEFT JOIN
          bin_levels            b   ON b.bin_level > a.logsalary
    ;
QUIT;

I would like to see the first ten rows look like this

logSalary     bin
4.2121275979  1
4.2195077052  1
4.248495242   1
4.248495242   1
4.248495242   1
4.248495242   1
4.248495242   1
4.3174881135  2
4.3174881135  2
4.3174881135  2
...

Thanks in advance

EDIT: for now, I am going to go with this solution

DATA bucketed_data;
    RETAIN bin bin_limit;
    SET baseball;

    IF _n_ = 1 THEN DO;
        bin_limit = logsalary;
        bin = 1;
    END;
    IF logsalary > bin_limit THEN DO;
        bin_limit + &bin_size.;
        bin + 1;
    END; 
RUN;

Solution

  • No need for macro variables put the values into a dataset and combine the dataset with the one you want to bin. Let's use 10 bins instead of 100 to make it easier to examine the results.

    First find the minimum and range:

    proc means n min max data=sashelp.baseball;
     var logsalary;
     output out=stats(keep=min range) min=min range=range;
    run;
    

    Then use those to bin the data:

    DATA bucketed_data;
      SET sashelp.baseball (keep=logsalary);
      if _n_=1 then set stats;
      if not missing(logsalary) then do bin=1 to 10 while(logsalary > min+bin*(range/10)); 
         * nothing to do here ;
      end;
    run;
    

    Let's use PROC MEANS to see how it worked.

    proc means n min max ;
     class bin / missing;
     var logsalary;
    run;
    

    Results:

    enter image description here