Search code examples
stata

Stata: sum of variable given other variable conditions


I'm trying to generate different 'total count' variables by companyid & year.

One 'total count' for subs, and one total count for loans.

Basically I'm trying to extend this question: Stata: Calculate sum of any x in y?

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 companyid int year float sub_num double sub_amt float(sub_year_total loan_num) double loan_amt float loan_year_total
"001004" 1999  .        0   425000  .       0       0
"001004" 1999  2   425000   425000  .       0       0
"001004" 2004  .        0        0  .       0       0
"001004" 2005  1  4232000  4232000  .       0       0
"001004" 2006  1 16000000 1.60e+07  .       0       0
"001004" 2007  3    58354   182444  .       0       0
"001078" 2006  .        0   471529  .       0       0
"001078" 2006  .        0   471529  .       0       0
"001078" 2006  .        0   471529  .       0       0
"001078" 2006  6    29872   471529  .       0       0
"001078" 2006  6    59748   471529  .       0       0
"001078" 2006  6   381909   471529  .       0       0
"001078" 2007  .        0   768825  7  270000 2580000
"001078" 2007  .        0   768825  7  360000 2580000
"001078" 2007  .        0   768825  7 1500000 2580000
"001078" 2007  .        0   768825  7  450000 2580000
"001078" 2007  .        0   768825  .       0 2580000
"001078" 2007  7   359454   768825  .       0 2580000
"001078" 2007  7   409371   768825  .       0 2580000
"001078" 2008  .        0  1751832  5  450000 2450000
"001078" 2008  .        0  1751832  5 2000000 2450000
"001078" 2008  5    47957  1751832  .       0 2450000
"001078" 2008  5   485631  1751832  .       0 2450000
"001078" 2008  5  1218244  1751832  .       0 2450000
end

To note: If sub_num = 0 then loan_num != 0, and vice versa.

I've tried bysort cik year: gen sub_num = _N if loan_amt != 0 and bysort cik year loan_amt: gen sub_num = _N but neither really does it. I've left my failed count variables in the examples for reference.

i.e. company #001078 in 2007 would have loan_num = 4 and sub_num = 2

I just noticed this example has one observation that has 0 for both, I can just eliminate entries that have 0 for both so no need to comment on that.

How can I make company total annual counts for my 'sub' and 'loan' variables?


Solution

  • This is a little hard to follow.

    • There is reference to cik in your code but it is not in your data example.

    • It is hard to know what is original data and what is the result of calculations you have tried.

    • The example seems more complicated than necessary.

    • Although the title refers to sums, it is also clear that you are interested in counting loans of certain kinds.

    A count is a sum of indicators, so this shows some technique rather than necessarily being an answer. Feed to egen, total() a true-or-false expression and the result will be the count of observations for which the expression is true (1); arguments that are false (0) are ignored in the sense that they make no difference to the sum.

    bysort companyid year : egen wanted1 = total(loan_amt > 0)
    
    bysort companyid year : egen wanted2 = total(loan_amt > 0 & sub_num < .)
    

    _N is just the number of observations, sometimes conditional on other variables. You naturally can assign that number to a variable, but also specifying an if qualifier doesn't make the calculation ignore the excluded values; it just affects which observations are ignored in receiving non-missing values. Consider this experiment:

    . clear
    
    . set obs 1000
    number of observations (_N) was 0, now 1,000
    
    . gen count = _N if _n == 1
    (999 missing values generated)
    
    . l count in 1
    
         +-------+
         | count |
         |-------|
      1. |  1000 |
         +-------+
    

    Otherwise put, _N is not as general a counting method as you need here.