I'd like to create a frequency table in R that takes another variable in consideration as weight.
To be more precise, as an "analytic weight", such as in Stata. According to the help files thereof,
aweights, or analytic weights, are weights that are inversely
proportional to the variance of an observation; i.e., the variance of
the jth observation is assumed to be sigma^2/w_j, where w_j are the
weights. Typically, the observations represent averages and the
weights are the number of elements that gave rise to the average.
For most Stata commands, the recorded scale of aweights is
irrelevant; Stata internally rescales them to sum to N, the number of
observations in your data, when it uses them.
The valuable contribution of a stackflow member was:
Table_WEIGHT <- xtabs(WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
> Prop <- prop.table(Table_WEIGHT)
> Cum <- cumsum(100 * Prop / sum(Prop))
> Cum
1 2 3 4 5 6 7
14.35397 29.14973 43.23935 57.31355 71.50782 85.80359 100.00000
> out <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
+ Prop = as.numeric(Prop), Cum = as.numeric(Cum))
> out
INTERVIEW_DAY Freq Prop Cum
1 1 11803438268 0.1435397 14.35397
2 2 12166729888 0.1479576 29.14973
3 3 11586059070 0.1408962 43.23935
4 4 11573379591 0.1407420 57.31355
5 5 11672116808 0.1419427 71.50782
6 6 11755579310 0.1429577 85.80359
7 7 11673877965 0.1419641 100.00000
Nonetheless, the frequency is still not what I'm expecting, because we are using the sum of the second variable as weight, instead of the "analytic weight" as disposed above.
The desired table should be:
(mean) |
interview_d |
ay | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,974.1424 14.35 14.35
2 | 3,065.6819 14.80 29.15
3 | 2,919.3688 14.09 43.24
4 |2,916.17392 14.07 57.31
5 |2,941.05299 14.19 71.51
6 | 2,962.0832 14.30 85.80
7 | 2,941.4968 14.20 100.00
------------+-----------------------------------
Total | 20,720 100.00
Note that the "Freq"s are quite different.
Here an example of the two variables (INTERVIEW_DATE) and WEIGHT(WEIGHT), which are the date of survey and a weight that was not specified in the original article.
> timeuse_2003$INTERVIEW_DATE[1:15]
[1] "2003-01-03" "2003-01-04" "2003-01-04" "2003-01-02" "2003-01-09" "2003-01-02" "2003-01-06"
[8] "2003-01-07" "2003-01-04" "2003-01-09" "2003-01-04" "2003-01-05" "2003-01-04" "2003-01-01"
[15] "2003-01-04"
> timeuse_2003$WEIGHT[1:15]
[1] 8155462.7 1735322.5 3830527.5 6622023.0 3068387.3 3455424.9 1637826.3 6574426.8 1528296.3
[10] 4277052.8 1961482.3 505227.2 2135476.8 5366309.3 1058351.1
I'll be thankful for any contribution.
I found a non-elegant solution according to Stata help files. I just added up the line
timeuse_2003$N_WEIGHT <- timeuse_2003$WEIGHT * 20720/ sum(timeuse_2003$WEIGHT)
and kept the code with
Table_WEIGHT <- xtabs(N_WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
Prop <- prop.table(Table_WEIGHT)
Cum <- cumsum(100 * Prop / sum(Prop))
Cum
Freq_Table <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
Prop = as.numeric(Prop), Cum = as.numeric(Cum))
Freq_Table
The table was then correct such as:
> Freq_Table
INTERVIEW_DAY Freq Prop Cum
1 1 2974.1424 0.14353969 14.353969
2 2 3065.6819 0.14795762 29.149731
3 3 2919.3688 0.14089618 43.239349
4 4 2916.1739 0.14074198 57.313547
5 5 2941.0530 0.14194271 71.507819
6 6 2962.0832 0.14295769 85.803587
7 7 2941.4968 0.14196413 100.000000
If someone could clarify how to substitute the number of observations I put in manually for something automatic (this code will be used in different datasets, so I can't update every single one, switching the number of observations everytime. Something like ".N" would be very fine!
Thank you!