I am working with a very large dataset (1 million obs.).
I have a string date that looks like this
key seq startdate (string)
AD07 1 August 2011
AD07 2 June 2011
AD07 3 February 2004
AD07 4 November 2004
AD07 5 2001
AD07 6 January 1998
AD5c23 1 January 2014
AD5c235 2 February 2014
AD5c235 3 2014
These are self-reported employment dates.
Some did not report the month at which they started.
But I would like to replace for AD07
the date “2001” to “January 2001”. Hence I cannot simply replace it because I would like to keep the original years but add the month in the string variable.
I started with:
levelsof start if start<="2016", local(levels)
which gives me all the years without the month from 1900 to 2016.
Now I would like to add "January" for the years without the month and keep original years.
How should I do that without using replace
for every year? foreach
loop?
You have a serious data quality problem if people are claiming to have started work in 1900 and every year since then! Even considering early employment starts and delayed retirement, that implies people older than the oldest established age.
Also, imputing "January" will impart bias as almost all job durations will be longer than they would have been. Real January starts will be correct, but no others: "June" or "July" or random months would make more obvious statistical sense.
That said, there is no loop needed here. You're asking for one line, say
replace startdate = "January " + startdate if length(trim(date)) == 4
or
replace startdate = "January " + startdate if real(startdate) < .
-- assuming a follow-up in converting to numeric dates. The logic there is that all year-only dates trim down to 4 characters, or (better) that feeding month names to real()
will yield missings.
That said in turn, creating a new variable is better practice than over-writing one. Also, consider throwing away the month detail. Is it needed?
EDIT
You may have another problem if there are people with two or more jobs in the same year without month specifications. You don't want to impute all months in question as "January". You can check for such observations by
gen byte incomplete = real(startdate) < .
gen year = substr(trim(startdate), -4, 4)
bysort key year incomplete : gen byte multiplebad = incomplete & _N > 1