I am trying to compute inter-survey estimates of population numbers between years 1981 and 1986 for each YEAR (1982...1985) and each AGE group (10-14,..., 55-59). The complicating factor in my dataset is that I have 52 provinces and around 600 ZONA91OK (districts) and each PROVINCE has a different number of districts.
The formula I would like to apply in order to obtain vectors containing the information for each missing YEAR, NATIONALITY, PROVINCE and for each district (ZONA91OK) is the following:
ex. for 1982
value for 1982, age group 10-14: x(10,1982)=[(x(10,1981)-x(15,1986))/5]-x(10,1982)
x(15,1982)=[(x(15,1981)-x(20,1986))/5]-x(15,1982)
x(20,1982)=[(x(20,1981)-x(25,1986))/5]-x(20,1982)
...
x(55,1982)=[(x(55,1981)-x(55,1986))/5]-x(55,1982) -exception-
Any help on the issue is very much appreciated!
Here is the reproducible sample (a subset of the whole database as it is very big)
mydata<-structure(list(YEAR = c(1981, 1981, 1981, 1981, 1981, 1981, 1981,
1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981,
1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981,
1981, 1981, 1981, 1981, 1981, 1981, 1981, 1986, 1986, 1986, 1986,
1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986,
1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986,
1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986),
PROVINCE = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
ZONA91OK = c(101, 101, 101, 101, 101, 101, 101, 101, 101,
102, 102, 102, 102, 102, 102, 102, 102, 102, 1036, 1036,
1036, 1036, 1036, 1036, 1036, 1036, 1036, 1059, 1059, 1059,
1059, 1059, 1059, 1059, 1059, 1059, 101, 101, 101, 101, 101,
101, 101, 101, 101, 102, 102, 102, 102, 102, 102, 102, 102,
102, 1036, 1036, 1036, 1036, 1036, 1036, 1036, 1036, 1036,
1059, 1059, 1059, 1059, 1059, 1059, 1059, 1059, 1059), AGE5 = c(10,
15, 20, 25, 30, 35, 40, 45, 50, 10, 15, 20, 25, 30, 35, 40,
45, 50, 10, 15, 20, 25, 30, 35, 40, 45, 50, 10, 15, 20, 25,
30, 35, 40, 45, 50, 10, 15, 20, 25, 30, 35, 40, 45, 50, 10,
15, 20, 25, 30, 35, 40, 45, 50, 10, 15, 20, 25, 30, 35, 40,
45, 50, 10, 15, 20, 25, 30, 35, 40, 45, 50), NATIONALITY = structure(c(9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L), .Label = c("España",
"UE-15 y PD", "Resto Europa", "Magreb", "África Sub-sahariana",
"Latinoamérica", "Asia", "Resto del Mundo", "No computable"
), class = "factor"), FREQUENCY = c(993.8141, 994.907, 894.0322,
845.8348, 659.6786, 577.2588, 540.6329, 684.9917, 673.6348,
910.511, 1068.9258, 936.9949, 763.547, 643.4404, 572.72,
536.6591, 665.975, 768.6866, 967.694100000002, 980.340100000001,
811.637500000001, 746.058500000001, 769.820600000001, 722.398000000001,
730.371600000001, 690.084600000001, 501.9178, 8243.04149999997,
7785.02419999994, 7505.78429999991, 7464.74579999992, 7663.47079999997,
6700.90559999997, 5203.31959999996, 5582.66059999997, 4837.30459999996,
869.1754, 982.7461, 945.5031, 904.2817, 813.7127, 663.955,
577.2896, 544.1257, 689.9815, 780.3824, 879.7538, 1025.5724,
882.475, 716.0049, 627.3571, 579.4372, 525.4546, 679.9666,
1035.6544, 952.521599999999, 962.537599999999, 832.3296,
733.1696, 726.1568, 704.1248, 700.1136, 667.0624, 9023.05139999993,
8285.31719999994, 8080.95919999994, 8175.28479999993, 7786.53429999994,
7796.56439999994, 6842.11639999996, 5239.83509999998, 5616.95939999997
)), .Names = c("YEAR", "PROVINCE", "ZONA91OK", "AGE5", "NATIONALITY",
"FREQUENCY"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L,
36L, 8173L, 8174L, 8175L, 8176L, 8177L, 8178L, 8179L, 8180L,
8181L, 8182L, 8183L, 8184L, 8185L, 8186L, 8187L, 8188L, 8189L,
8190L, 8191L, 8192L, 8193L, 8194L, 8195L, 8196L, 8197L, 8198L,
8199L, 8200L, 8201L, 8202L, 8203L, 8204L, 8205L, 8206L, 8207L,
8208L), class = "data.frame")
If you can guarantee that there is no missing data -- that is, every year contains exactly the same number of rows, and the same combinations of province, age, nationality, and zone, then here is a simple solution:
df<-mydata[with(mydata,order(YEAR,NATIONALITY,PROVINCE,ZONA91OK,AGE5)),]
splitdata<-split(df,df$YEAR)
for(i in 1982:1985){
chi<-as.character(i)
splitdata[[chi]]<-splitdata$`1981`
splitdata[[chi]]$YEAR<-i
splitdata[[chi]]$FREQUENCY<-splitdata$`1986`$FREQUENCY*(i-1981)/5+
splitdata$`1981`$FREQUENCY*(1986-i)/5
}
newdata<-do.call(rbind,splitdata)
newdata
On edit: The ddply
method posted by @shadow is a more "correct" R way to do this, and using ddply
will make it easier to deal with missing data. If you don't have missing data, though, then the split method seems to be more efficient:
Unit: milliseconds
expr min lq median uq max neval
splitmethod 2.573737 2.638159 2.676954 2.735601 235.2619 100
ddplymethod 9.812680 9.989192 10.148241 17.128667 243.7309 100