I have a string
variable called country
with a value which can be for example Afghanistan2008
, but it can also be Brasil2012
. I would like to create two new variables, one being the country part and one the year part .
Because there are always numbers at the end of the string
, I do know the position the string
should be split at from the right side but not from the left side.
Could I use something like:
gen(substr("country",-4,.))
If not, could anyone tell me how to split an entire column of such variables into a country
and a year
variable? I would also like to keep the original variable.
You can use a regular expression:
clear
set obs 2
generate string = ""
replace string = "Afghanistan2008" in 1
replace string = "Brasil2012" in 2
generate country = regexs(0) if regex(string, "[a-zA-Z]+")
generate year = regexs(1) + regexs(2) if regex(string, "(19|20)([0-9][0-9])")
list
+--------------------------------------+
| string country year |
|--------------------------------------|
1. | Afghanistan2008 Afghanistan 2008 |
2. | Brasil2012 Brasil 2012 |
+--------------------------------------+
Type help regex
in Stata's command prompt for more information.
Alternatively you could do the following:
generate len = length(string) - 3
generate country2 = substr(string, 1, len - 1)
generate year2 = substr(string, len, .)
list country2 year2
+---------------------+
| country2 year2 |
|---------------------|
1. | Afghanistan 2008 |
2. | Brasil 2012 |
+---------------------+