Search code examples
stata

Stata: How to modify some values in a string variable but keep original values?


I am working with a very large dataset (1 million obs.).

I have a string date that looks like this

key seq startdate (string)  
AD07    1   August 2011 
AD07    2   June 2011   
AD07    3   February 2004   
AD07    4   November 2004   
AD07    5   2001    
AD07    6   January 1998    
AD5c23  1   January 2014    
AD5c235 2   February 2014   
AD5c235 3   2014    

These are self-reported employment dates.

Some did not report the month at which they started. But I would like to replace for AD07 the date “2001” to “January 2001”. Hence I cannot simply replace it because I would like to keep the original years but add the month in the string variable.

I started with:

levelsof start if start<="2016", local(levels)

which gives me all the years without the month from 1900 to 2016.

Now I would like to add "January" for the years without the month and keep original years.

How should I do that without using replace for every year? foreach loop?


Solution

  • You have a serious data quality problem if people are claiming to have started work in 1900 and every year since then! Even considering early employment starts and delayed retirement, that implies people older than the oldest established age.

    Also, imputing "January" will impart bias as almost all job durations will be longer than they would have been. Real January starts will be correct, but no others: "June" or "July" or random months would make more obvious statistical sense.

    That said, there is no loop needed here. You're asking for one line, say

    replace startdate = "January " + startdate if length(trim(date)) == 4 
    

    or

    replace startdate = "January " + startdate if real(startdate) < . 
    

    -- assuming a follow-up in converting to numeric dates. The logic there is that all year-only dates trim down to 4 characters, or (better) that feeding month names to real() will yield missings.

    That said in turn, creating a new variable is better practice than over-writing one. Also, consider throwing away the month detail. Is it needed?

    EDIT

    You may have another problem if there are people with two or more jobs in the same year without month specifications. You don't want to impute all months in question as "January". You can check for such observations by

    gen byte incomplete = real(startdate) < . 
    gen year = substr(trim(startdate), -4, 4) 
    bysort key year incomplete : gen byte multiplebad = incomplete & _N > 1