Search code examples
countstatafrequency

Get frequency counts for a subset of elements in a column


I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.

clear
input   str3 (PartID  Type   FailType) 
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0

end 

I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.

I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.

The expected results here will be:
PartID Freq 
ABD 4 #(4 occurs twice)
ABC 2  #(tie broken with minimum)
BBB 0  #(0 occurs 3 times)

Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)

Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?

For example get the first row of the table.

Table. ---------------------- 
Type| Freq ----------+----------- 
A | -1 
B | -1 
C | -1 
D | -3 
S | -3 
---------------------- –

Solution

  • I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.

    clear
    input   str3 (PartID  Type   FailType) 
        ABD A 4
        BBB S 0
        ABD A 3
        ABD A 4
        ABC A 2
        BBB A 0
        ABD B 1
        ABC B 7
        BBB C 1
        BBB D 0
    end 
    
    bysort PartID FailType: gen Freq = -_N 
    bysort PartID (Freq Type) : gen ToShow = _n == 1
    replace Freq = -Freq 
    list PartID Type FailType Freq  if ToShow  
    
         +---------------------------------+
         | PartID   Type   FailType   Freq |
         |---------------------------------|
      1. |    ABC      A          2      1 |
      3. |    ABD      A          4      2 |
      7. |    BBB      A          0      3 |
         +---------------------------------+