Search code examples
rfrequency

Frequency table with second variable as "analytic weight" in R


I'd like to create a frequency table in R that takes another variable in consideration as weight.

To be more precise, as an "analytic weight", such as in Stata. According to the help files thereof,

aweights, or analytic weights, are weights that are inversely
        proportional to the variance of an observation; i.e., the variance of
        the jth observation is assumed to be sigma^2/w_j, where w_j are the
        weights.  Typically, the observations represent averages and the
        weights are the number of elements that gave rise to the average.
        For most Stata commands, the recorded scale of aweights is
        irrelevant; Stata internally rescales them to sum to N, the number of
        observations in your data, when it uses them.

The valuable contribution of a stackflow member was:

Table_WEIGHT <- xtabs(WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
> Prop <- prop.table(Table_WEIGHT)
> Cum <- cumsum(100 * Prop / sum(Prop))
> Cum
        1         2         3         4         5         6         7 
 14.35397  29.14973  43.23935  57.31355  71.50782  85.80359 100.00000 
> out <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
+                   Prop = as.numeric(Prop), Cum = as.numeric(Cum))
> out
  INTERVIEW_DAY        Freq      Prop       Cum
1             1 11803438268 0.1435397  14.35397
2             2 12166729888 0.1479576  29.14973
3             3 11586059070 0.1408962  43.23935
4             4 11573379591 0.1407420  57.31355
5             5 11672116808 0.1419427  71.50782
6             6 11755579310 0.1429577  85.80359
7             7 11673877965 0.1419641 100.00000

Nonetheless, the frequency is still not what I'm expecting, because we are using the sum of the second variable as weight, instead of the "analytic weight" as disposed above.

The desired table should be:

 (mean) |
interview_d |
         ay |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 | 2,974.1424       14.35       14.35
          2 | 3,065.6819       14.80       29.15
          3 | 2,919.3688       14.09       43.24
          4 |2,916.17392       14.07       57.31
          5 |2,941.05299       14.19       71.51
          6 | 2,962.0832       14.30       85.80
          7 | 2,941.4968       14.20      100.00
------------+-----------------------------------
      Total |     20,720      100.00

Note that the "Freq"s are quite different.

Here an example of the two variables (INTERVIEW_DATE) and WEIGHT(WEIGHT), which are the date of survey and a weight that was not specified in the original article.

> timeuse_2003$INTERVIEW_DATE[1:15]
 [1] "2003-01-03" "2003-01-04" "2003-01-04" "2003-01-02" "2003-01-09" "2003-01-02" "2003-01-06"
 [8] "2003-01-07" "2003-01-04" "2003-01-09" "2003-01-04" "2003-01-05" "2003-01-04" "2003-01-01"
[15] "2003-01-04"


> timeuse_2003$WEIGHT[1:15]
 [1] 8155462.7 1735322.5 3830527.5 6622023.0 3068387.3 3455424.9 1637826.3 6574426.8 1528296.3
[10] 4277052.8 1961482.3  505227.2 2135476.8 5366309.3 1058351.1

I'll be thankful for any contribution.


Solution

  • I found a non-elegant solution according to Stata help files. I just added up the line

    timeuse_2003$N_WEIGHT <- timeuse_2003$WEIGHT * 20720/ sum(timeuse_2003$WEIGHT)
    

    and kept the code with

    Table_WEIGHT <- xtabs(N_WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
    Prop <- prop.table(Table_WEIGHT)
    Cum <- cumsum(100 * Prop / sum(Prop))
    Cum
    Freq_Table <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
                      Prop = as.numeric(Prop), Cum = as.numeric(Cum))
    Freq_Table
    

    The table was then correct such as:

    > Freq_Table
      INTERVIEW_DAY      Freq       Prop        Cum
    1             1 2974.1424 0.14353969  14.353969
    2             2 3065.6819 0.14795762  29.149731
    3             3 2919.3688 0.14089618  43.239349
    4             4 2916.1739 0.14074198  57.313547
    5             5 2941.0530 0.14194271  71.507819
    6             6 2962.0832 0.14295769  85.803587
    7             7 2941.4968 0.14196413 100.000000
    

    If someone could clarify how to substitute the number of observations I put in manually for something automatic (this code will be used in different datasets, so I can't update every single one, switching the number of observations everytime. Something like ".N" would be very fine!

    Thank you!