I am looking to create an optimal bucketing macro. My first obstacle is to create equidistant buckets. I am using the sashelp.baseball dataset as an example.
I take the range of logsalary and divide this by 100 to create the distance between each bucket. Then I would like to assign the logsalary column a bucket value if the logsalary is smaller than the bucket value
The code I have tried is attached. I am hoping to be able to join or merge on the bucket limit values and use a greater than or smaller than clause to append a bucket value
/*Sort the baseball dataset by smallest to largest, removing any missing data*/
PROC SORT
DATA = sashelp.baseball
(KEEP = logsalary
WHERE = (NOT MISSING(logsalary)))
OUT = baseball;
BY logsalary;
RUN;
/*Identify the size of each bucket by splitting the range into 100 equidistant buckets*/
DATA _NULL_;
RETAIN bin_size;
SET baseball END = EOF;
IF _N_ = 1 THEN DO;
bin_size = logsalary;
CALL SYMPUT("min_bin",logsalary);
END;
IF EOF THEN DO;
bin_size = ((logsalary - bin_size) / 100);
CALL SYMPUT("bin_size",bin_size);
END;
RUN;
/*Create a vector to identify each bucket range*/
DATA bin_levels;
DO bin = 1 TO 100;
IF bin = 1 THEN DO;
bin_level = &min_bin.;
OUTPUT;
END;
ELSE DO;
bin_level = &min_bin. + &bin_size. * bin;
OUTPUT;
END;
END;
RUN;
/*Append a bucket number based on the logsalary being smaller than the next bucket value*/
PROC SQL;
CREATE TABLE binned_data AS
SELECT
a.*
, b.bin
, b.bin_level
FROM
baseball a
LEFT JOIN
bin_levels b ON b.bin_level > a.logsalary
;
QUIT;
I would like to see the first ten rows look like this
logSalary bin
4.2121275979 1
4.2195077052 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.3174881135 2
4.3174881135 2
4.3174881135 2
...
Thanks in advance
EDIT: for now, I am going to go with this solution
DATA bucketed_data;
RETAIN bin bin_limit;
SET baseball;
IF _n_ = 1 THEN DO;
bin_limit = logsalary;
bin = 1;
END;
IF logsalary > bin_limit THEN DO;
bin_limit + &bin_size.;
bin + 1;
END;
RUN;
No need for macro variables put the values into a dataset and combine the dataset with the one you want to bin. Let's use 10 bins instead of 100 to make it easier to examine the results.
First find the minimum and range:
proc means n min max data=sashelp.baseball;
var logsalary;
output out=stats(keep=min range) min=min range=range;
run;
Then use those to bin the data:
DATA bucketed_data;
SET sashelp.baseball (keep=logsalary);
if _n_=1 then set stats;
if not missing(logsalary) then do bin=1 to 10 while(logsalary > min+bin*(range/10));
* nothing to do here ;
end;
run;
Let's use PROC MEANS to see how it worked.
proc means n min max ;
class bin / missing;
var logsalary;
run;
Results: