Search code examples
stringstatalocal-variablesstata-macros

Is there a simple way of extract the N first words from a local macro which is comma or space+comma separated in Stata?


Given a local macro that contains a string of levels which are separated by either comma (",") or comma and space (", ") or even only space (" "), is there a simple way to extract the first N levels (or words) of this local macro?

The string would look like "12, 123, 1321, 41", or "12,123,1321,41" or "12 123 1321 41".

Basically I would be happy with a version of the Macro Function word # of string that would work more or less like word 1/N of string. (See "Macro functions for parsing" in pg 12 in Macro definition and manipulation)

For more context, I am working with the output of levelsof, local() sep(). So I can choose the separator that can be worked with more easily. I want to pass the resulting levels as an argument to the inlist() function. The following usually works, but inlist() only takes up to 250 arguments. That is why I would like to extract chunks of 250 words of the results of levelsof()

sysuse auto, clear
levelsof mpg if trunk > 20, local(levels) sep(", ")
list if inlist(mpg, `levels')

"solution" so far

I have figured out a non-simple of way achieving that, but it is not looking good and I am wondering if there is a simple, built-in way of doing the same.

sysuse auto, clear

levelsof mpg if trunk > 20, local(levels) sep(", ")
scalar number_of_words = 3
forvalues i = 1 (1) `=number_of_words' {
        local word_i = `i'
        local this_level : word `word_i' of `levels'
        local list_of_levels = "`list_of_levels'`this_level'" 
        
        di as text "loop: `i'"
        di as text "this level: `this_level'"
        di as text "list of levels so far: `list_of_levels'"
    }

di "`list_of_levels'"

// trim trailing comma
local trimmed_list_of_levels = substr( "`list_of_levels'" , 1 , strlen( "`list_of_levels'" )-1) 

di "`trimmed_list_of_levels'"
list make mpg price trunk if inlist(mpg, `trimmed_list_of_levels')

output

. sysuse auto, clear
(1978 Automobile Data)

. 
. levelsof mpg if trunk > 20, local(levels) sep(", ")
12, 15, 17, 18

. scalar number_of_words = 3

. forvalues i = 1 (1) `=number_of_words' {
  2.         local word_i = `i'
  3.         local this_level : word `word_i' of `levels'
  4.         local list_of_levels = "`list_of_levels'`this_level'" 
  5.         
.         di as text "loop: `i'"
  6.         di as text "this level: `this_level'"
  7.         di as text "list of levels so far: `list_of_levels'"
  8.     }
loop: 1
this level: 12,
list of levels so far: 12,
loop: 2
this level: 15,
list of levels so far: 12,15,
loop: 3
this level: 17,
list of levels so far: 12,15,17,

. 
. di "`list_of_levels'"
12,15,17,

. 
. // trim trailing comma
. local trimmed_list_of_levels = substr( "`list_of_levels'" , 1 , strlen( "`list_of_levels'" )-1) 

. 
. di "`trimmed_list_of_levels'"
12,15,17

. list make mpg price trunk if inlist(mpg, `trimmed_list_of_levels')

     +------------------------------------------+
     | make                mpg    price   trunk |
     |------------------------------------------|
  2. | AMC Pacer            17    4,749      11 |
  5. | Buick Electra        15    7,827      20 |
 23. | Dodge St. Regis      17    6,342      21 |
 26. | Linc. Continental    12   11,497      22 |
 27. | Linc. Mark V         12   13,594      18 |
     |------------------------------------------|
 31. | Merc. Marquis        15    6,165      23 |
 53. | Audi 5000            17    9,690      15 |
 74. | Volvo 260            17   11,995      14 |
     +------------------------------------------+

edits relating to comments.

edit 01)

The following does not work, for example. It returns the error 130 expression too long.

clear 

set obs 1000
gen id = _n 
gen x1 = rnormal()

sum * 
levelsof id if x1>0, local(levels) sep(", ")
sum * if inlist(id, `levels')

example where this construction (levelsof + inlist) seems to be necessary

clear 

set obs 5000
gen id = round(_n/5)
gen x1 = rnormal()

sum * 
levelsof id if x1>2, local(levels) sep(", ")
sum * if x1>2 // if threshold is small enough, there will be too many values for inlist()
sum * if inlist(id, `levels')

Solution

  • Using your additional example as a basis, you could use egen max to create a flag that is 1 for entire id that has any cases where x1 value is above a certain threshold. For example:

    clear 
    set seed 2021
    set obs 5000
    gen id = round(_n/5)
    gen x1 = rnormal()
    
    sum * 
    levelsof id if x1>2, local(levels) sep(", ")
    sum * if x1>2 // if threshold is small enough, there will be too many values for inlist()
    sum * if inlist(id, `levels')
    
    //This will do the same thing
    gen over_threshold = x1>2 
    egen id_over_thresh = max(over_threshold), by(id)
    
    sum * if id_over_thresh