Search code examples
interpolationpanelstata

Expanding annual to semiannual data in panel data


I have a panel dataset with the share of young people in a country(variable is named value) for 30 OECD countries for the years 1960-2050.

The data was only available at an annual frequency, but I want it at a semiannual frequency, so I want to expand and interpolate the data. The goal is to have the value for each country for every half year.

I have expanded the dataset and assigned the duplicates dupe=1 to keep track. Then I have generated a variable named year_half taking the value 1 for the originals (first half of the year) and the value 2 for the duplicates (second half of the year). Then I interpolate. The code is:

expand 2, gen(dupe)

gen year_half = 1 if dupe==0
replace year_half = 2 if dupe==1

bysort year : gen hdate = yh(year, year_half)
by year : replace value = . if dupe==1

ipolate value hdate, gen(linear) epolate

I don't get any error codes, but I don't get the right outcome either. The variable linear only has interpolated values for some of the observations (the rest are missing), and it does not seem to be corresponding to year_half, hdate or dupe. Does anyone know how to fix this?


Solution

  • This seems to be the essence of what you are doing, except that the panel structure requires a repetition by country:.

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(year value)
    2010 2
    2011 4
    2012 6
    end
    
    expand 2
    bysort year : replace value = . if _n == 2
    by year : gen halfyear = yh(year, _n)
    
    ipolate value halfyear , gen(value2) epolate
    
    list, sep(0)
    
         +----------------------------------+
         | year   value   halfyear   value2 |
         |----------------------------------|
      1. | 2010       2        100        2 |
      2. | 2010       .        101        3 |
      3. | 2011       4        102        4 |
      4. | 2011       .        103        5 |
      5. | 2012       6        104        6 |
      6. | 2012       .        105        7 |
         +----------------------------------+
    

    However, your method is arbitrary. You take the yearly value as applying in the first half of the year, and would get a different answer if you took it that it applies to the second half of the year.

    Here's a more symmetrical procedure:

     clear
     input float(year value)
     2010 2
     2011 4
     2012 6
     end
        
     expand 2 
        
     sort year 
     gen VALUE = (value[_n-1] + 2 * value + value[_n+1])/4 
     by year : gen halfyear = yh(year, _n)
     ipolate VALUE halfyear, gen(value2) epolate 
        
     list, sep(0) 
     
          +------------------------------------------+
         | year   value   VALUE   halfyear   value2 |
         |------------------------------------------|
      1. | 2010       2       .        100      1.5 |
      2. | 2010       2     2.5        101      2.5 |
      3. | 2011       4     3.5        102      3.5 |
      4. | 2011       4     4.5        103      4.5 |
      5. | 2012       6     5.5        104      5.5 |
      6. | 2012       6       .        105      6.5 |
         +------------------------------------------+
    

    EDIT Tentative generalisation to panel data (not tested)

      expand 2 
      sort country year 
      by country: gen VALUE = (value[_n-1] + 2 * value + value[_n+1])/4 
      by country year : gen halfyear = yh(year, _n)
      by country: ipolate VALUE halfyear, gen(value2) epolate