Search code examples
variablessplitstatagenerate

Splitting a composite variable into two variables


I have a string variable called country with a value which can be for example Afghanistan2008, but it can also be Brasil2012. I would like to create two new variables, one being the country part and one the year part .

Because there are always numbers at the end of the string, I do know the position the string should be split at from the right side but not from the left side.

Could I use something like:

gen(substr("country",-4,.))

If not, could anyone tell me how to split an entire column of such variables into a country and a year variable? I would also like to keep the original variable.


Solution

  • You can use a regular expression:

    clear
    set obs 2
    
    generate string = ""
    replace string = "Afghanistan2008" in 1
    replace string = "Brasil2012" in 2
    
    generate country = regexs(0) if regex(string, "[a-zA-Z]+")
    generate year = regexs(1) + regexs(2) if regex(string, "(19|20)([0-9][0-9])")
    
    list
    
       +--------------------------------------+
       |          string       country   year |
       |--------------------------------------|
    1. | Afghanistan2008   Afghanistan   2008 |
    2. |      Brasil2012        Brasil   2012 |
       +--------------------------------------+
    

    Type help regex in Stata's command prompt for more information.

    Alternatively you could do the following:

    generate len = length(string) - 3
    
    generate country2 = substr(string, 1, len - 1)
    generate year2 = substr(string, len, .)
    
    list country2 year2
    
       +---------------------+
       |    country2   year2 |
       |---------------------|
    1. | Afghanistan    2008 |
    2. |      Brasil    2012 |
       +---------------------+