Search code examples
sortingconcatenationstata

Concatenate and sort at the same time


I have a rather simple task which I am struggling with in Stata.

I have three variables SicTwo1 SicTwo2 SicTwo3, which are numeric (e.g. "12", "25", and "16")

I now want to concatenate them into a new variable SicAndSicAndSic, BUT they shall be ordered from lowest to highest value (e.g. "121625"), ideally with a separator (e.g. "12&16&25")

I tried this code:

gen NumSics = 0 
        replace NumSics = NumSics + 1 if !missing(SicTwo1)
        replace NumSics = NumSics + 1 if !missing(SicTwo2) & SicTwo1 != SicTwo2
        replace NumSics = NumSics + 1 if !missing(SicTwo3) & SicTwo1 != SicTwo2 & SicTwo1 != SicTwo3 & SicTwo2 != SicTwo3

        sort SicTwo1 SicTwo2 SicTwo3

        gen SicAndSic1 = string(SicTwo1)
        gen SicAndSic2 = string(SicTwo1) + "&" + string(SicTwo2)
        gen SicAndSic3 = string(SicTwo1) + "&" + string(SicTwo2) + "&" + string(SicTwo3)

        gen SicAndSic = ""
        replace SicAndSic = SicAndSic1 if NumSics == 1
        replace SicAndSic = SicAndSic2 if NumSics == 2
        replace SicAndSic = SicAndSic3 if NumSics == 3

But it does not sort the variables, and just puts them next to each other.


Solution

  • See https://www.stata-journal.com/article.html?article=pr0046 for one way to sort variables within observations (rows).

    clear 
    input SicTwo1 SicTwo2 SicTwo3 
    12 16 25 
    23 12 11 
    99 88 11 
    end 
    
    rowsort SicTwo?, gen(S1 S2 S3)
    
    egen wanted = concat(S?) , p(&)
    
    list 
    
         +-------------------------------------------------------+
         | SicTwo1   SicTwo2   SicTwo3   S1   S2   S3     wanted |
         |-------------------------------------------------------|
      1. |      12        16        25   12   16   25   12&16&25 |
      2. |      23        12        11   11   12   23   11&12&23 |
      3. |      99        88        11   11   88   99   11&88&99 |
         +-------------------------------------------------------+
    
    

    Your code shows a misunderstanding of sort, which sorts observations by values of variables, but emphatically does not sort within observations -- which is precisely what does rowsort does, with the proviso that the original variables are unchanged, and the results go in new variables.

    Your variables are stated to be numeric, so any missing values will by default be sorted to high. If you want something else, you need to spell out what that is.