Descriptive statistics in Stata - Word frequencies

I have a large data set containing as variables fileid, year and about 1000 words (each word is a separate variable). All line entries come from company reports indicating the year, an unique fileid and the respective absolute frequency for each word in that report. Now I want some descriptive statistics: Number of words not used at all, Mean of words, variance of words, top percentile of words. How can I program that in Stata?

Solution

Caveat: You are probably better off using a text processing package in R or another program. But since no one else has answered, I'll give it a Stata-only shot. There may be an ado file already built that is much better suited, but I'm not aware of one.

I'm assuming that

each word is a separate variable

means that there is a variable word_profit that takes a value k from 0 to K where word_profit[i] is the number of times profit is written in the i-th report, fileid[i].

Mean of words

collapse (mean) word_* will give you the average number of times the words are used. Adding a by(year) option will give you those means by year. To make this more manageable than a very wide one observation dataset, you'll want to run the following after the collapse:

gen temp = 1
reshape long word_, i(temp) j(str) string
rename word_ count
drop temp

Variance of words

collapse (std) word_* will give you the standard deviation. To get variances, just square the standard deviation.

Number of words not used at all

Without a bit more clarity, I don't have a good idea of what you want here. You could count zeros for each word with:

foreach var of varlist word_* {
  gen zero_`var' = (`var' == 0)
}
collapse (sum) zero_*