Search code examples
sqlfunctionsasdatasetsas-macro

Split description into multiple rows


I'm trying to accomplish the following in SAS.

My dataset looks like this:

|ID  |Description
|----|-----------------
| 1  |Object Car_bmw Processed.Colour changed from Red to Green Mileage changed from 30 to 32 Object Car_Audi Processed.Colour changed from Blue to White Mileage changed from 0 to 5
| 2  |Object Car_Kia Processed. Colour changed from White to Black Mileage changed from 3 to 9 Value changed from 12034 to 11029
| 3  |Object Phone_Iphone Processed. Colour changed from Black to Green Value changed from 300 to 290 Object Car_bmw Processed. Colour changed from White to Red Mileage changed from 100 to 50

I would like to create a new object column which will split the "description" column into so:

|ID  |Index| Description
|----|-----|-----------
| 1  | 1   | Object Car_bmw Processed.Colour changed from Red to Green Mileage changed from 30 to 32
|  1 | 2   | Object Car_Audi Processed.Colour changed from Blue to White Mileage changed from 0 to 5                                
| 2  | 1   | Object Car_Kia Processed.Colour changed from White to Black Mileage changed from 3 to 9 Value changed from 12034 to 11029
| 3  | 1   | Object Phone_Iphone Processed.Colour changed from Black to Green Value changed from 300 to 290 
| 3  | 2   | Object Car_bmw Processed. Colour changed from White to Red Mileage changed from 100 to 50

I tried using the following functions to separate the objects and "descriptions" into different rows. I used "Processed." as a delimiter since that is common between the different objects. However, it didn't work

data want;
set have;
do index = 1 to countw(LOG_LONG_DESC,'Processed.');
line_part = dequote(scan(LOG_LONG_DESC,index,'Processed.'));
OUTPUT;
end;
run;

Solution

  • Assuming there is some delimiter between the lines in original photograph then I would recommend splitting it into one observation per line and making a second variable that increments when the word PROCESSED appears in the line.

    Let's make a sample dataset that is using CR as the delimiter between the lines.

    data have;
      length id 8 description $400;
      id=1;
      description='Object Car_bmw Processed.' || '0D'x
                ||'Colour changed from Red to Green' || '0D'x 
                ||'Mileage changed from 30 to 32' || '0D'x
                ||'Object Car_Audi Processed.' || '0D'x
                ||'Colour changed from Blue to White' || '0D'x 
                ||'Mileage changed from 0 to 5'
      ;
      output;
    run;
    

    Now we can use COUNTW() and SCAN() to split it into lines and FINDW() to detect when a line has PROCESSED in it.

    data want;
      set have;
      group=0;
      do lineno=1 to countw(description,'0D'x);
        length line $100;
        line=scan(description,lineno,'0D'x);
        if findw(line,'processed',,'spit') then group+1;
        output;
      end;
      drop description;
    run;
    

    Result

    Obs    id    group    lineno                  line
    
     1      1      1         1      Object Car_bmw Processed.
     2      1      1         2      Colour changed from Red to Green
     3      1      1         3      Mileage changed from 30 to 32
     4      1      2         4      Object Car_Audi Processed.
     5      1      2         5      Colour changed from Blue to White
     6      1      2         6      Mileage changed from 0 to 5