Search code examples
regexstata

Create new string variable with partial matching of another


I am using Stata 15 and I would like to create a new string variable based on the contents of another.

Consider the following toy variable:

clear

input str18 string
"a b c"        
"d e f"
"g h i"    
end

I know I can use the regexm() function to extract all occurrences of a, b, d and g:

generate new = regexm(string, "a|c|d|g")

list

|string    new |
|--------------|
|  a b c     1 |
|  d e f     1 |
|  g h i     1 |

However, how can I get the following?

|string    new   |
|----------------|
|  a b c     a c |
|  d e f     d   |
|  g h i     g   |

Solution

  • You can use the ustrregexra() function to eliminate any occurrences of the matching characters:

    clear
    
    input str5 string
    "a b c"        
    "d e f"
    "g h i"    
    end
    
    generate wanted = ustrregexra(string, "[^a|c|d|g]", " ")
    
    list
    
         +-----------------+
         | string   wanted |
         |-----------------|
      1. |  a b c    a   c |
      2. |  d e f    d     |
      3. |  g h i    g     |
         +-----------------+
    

    If you want to eliminate the remain spaces:

    replace wanted = strtrim(stritrim(wanted))
    
         +-----------------+
         | string   wanted |
         |-----------------|
      1. |  a b c      a c |
      2. |  d e f        d |
      3. |  g h i        g |
         +-----------------+