Search code examples
stringsplitstata

How to split a string into only two parts (and not discard other parts)


Say that I have these data:

clear all
set obs 2
gen title = "dog - cat - horse" in 1
replace title = "chicken - frog - ladybug" in 2
tempfile data
save `data'

I can split these into three parts:

use `data', clear
split title, p(" - ") 

And I can split them into two parts, discarding the third part:

use `data', clear
split title, p(" - ") limit(2)

Is there an off-the-shelf solution to split into only two parts, but to group everything after the first splitting character (dash in this case) into the second variable? In R, I would use separate with the extra="merge" option (see tidyr separate only first n instances).

In other words, for the first row, I would like the first observation's title1 to be dog and for title2 to be cat - horse.

I realize that this is possible using custom code (see Stata split string into parts), but I am hoping for a simple command along the lines of Stata's split/R's separate to accomplish my goal.


Solution

  • This isn't at present an option in the official split command. (Full disclosure: I was the previous author.)

    You could just write your own command. This one needs more generality and more error checks, but it does what I think you want with your data example. Detail: is trimming spaces desired?

    clear all
    set obs 2
    gen title = "dog - cat - horse" in 1
    replace title = "chicken - frog - ladybug" in 2
    
    gen title1 = trim(substr(title, 1, strpos(title, "-") - 1))
    gen title2 = trim(substr(title, strpos(title, "-") + 1, .))
    
    program split2
        syntax varname(string), parse(str) [suffixes(numlist int min=2 max=2)]
        
        if "`suffixes'" == "" local suffixes "1 2"
        tokenize "`suffixes'"
        
        gen `varlist'`1' = trim(substr(`varlist', 1, strpos(`varlist', "`parse'") - 1))
        gen `varlist'`2' = trim(substr(`varlist', strpos(`varlist', "`parse'") + strlen("`parse'"), .))
    end 
    
    split2 title, parse("-") suffixes(3 4)
    
    list 
        
         +--------------------------------------------------------------------------------+
         |                    title    title1           title2    title3           title4 |
         |--------------------------------------------------------------------------------|
      1. |        dog - cat - horse       dog      cat - horse       dog      cat - horse |
      2. | chicken - frog - ladybug   chicken   frog - ladybug   chicken   frog - ladybug |
         +--------------------------------------------------------------------------------+
    
    

    Note also the egen function ends() and its head and tail options. Using that would need two calls. It generates just one variable at a time.