Search code examples
statisticsstatafrequency-analysis

Descriptive statistics in Stata - Word frequencies


I have a large data set containing as variables fileid, year and about 1000 words (each word is a separate variable). All line entries come from company reports indicating the year, an unique fileid and the respective absolute frequency for each word in that report. Now I want some descriptive statistics: Number of words not used at all, Mean of words, variance of words, top percentile of words. How can I program that in Stata?


Solution

  • Caveat: You are probably better off using a text processing package in R or another program. But since no one else has answered, I'll give it a Stata-only shot. There may be an ado file already built that is much better suited, but I'm not aware of one.

    I'm assuming that

    each word is a separate variable

    means that there is a variable word_profit that takes a value k from 0 to K where word_profit[i] is the number of times profit is written in the i-th report, fileid[i].

    Mean of words

    collapse (mean) word_* will give you the average number of times the words are used. Adding a by(year) option will give you those means by year. To make this more manageable than a very wide one observation dataset, you'll want to run the following after the collapse:

    gen temp = 1
    reshape long word_, i(temp) j(str) string
    rename word_ count
    drop temp
    
    Variance of words

    collapse (std) word_* will give you the standard deviation. To get variances, just square the standard deviation.

    Number of words not used at all

    Without a bit more clarity, I don't have a good idea of what you want here. You could count zeros for each word with:

    foreach var of varlist word_* {
      gen zero_`var' = (`var' == 0)
    }
    collapse (sum) zero_*