Search code examples
stata

Stata: Reducing observations based on yearly data


I want to create a variable that is one or zero if a company (companyid below) is "multicolor" in each year. Below is my data:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 companyid int year float(red blue green)
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2017 1 0 0
"001045" 2017 1 0 0

"001049" 2019 0 1 0
"001049" 2019 0 0 1

"001055" 2018 1 0 0
"001055" 2018 0 1 0
"001055" 2018 0 0 1

So for example, company #001055 is red, blue, and green for 2018 so this 'multicolor' variable should equal to one.

Additionally, I also want to create variables for the different combinations. I.e. a red-blue var = 1 if a company is red and blue = 1 in each year.

I was trying to do something with bysort companyid year: gen multicolor = 1 if red == 1 & blue == 1 & green == 1 but I realize that has a lot missing in what I want to accomplish.

The overall goal is to reduce multiple year observations so I have one observation per year per company.

This single year/company record would have the info if that company was red, green, blue, or the exact mix of these colors if it is mixed. Below would be the example of data that I want to create from the data above.

input str6 companyid int year float(red blue green r-b-g red-blue blue-green ...more...)
"001045" 2015 0 1 0 0 0 0 ...
"001045" 2017 1 0 0 0 0 0 ...
"001049" 2019 0 0 0 0 0 1 ...
"001055" 2018 0 0 0 1 0 0 ...

Solution

  • I think this is a lot easier than you are fearing. First, collapse to maximum values by company and year. Then you have the individual values of red blue green. Second, concatenate the values, so that "110" is red and blue but not green, and so on.

    tabulate would generate all the indicators corresponding to combinations found in the data.

    In effect, the 3 colors and 2 possibilities permit binary encoding, and the string is a binary number too.

    The correspondence for true 1 and false 0 that maximum over 0s and 1s means "any" and that minimum over 0s and 1s means "all" is obvious once understood, but worth explaining otherwise. For a Stata context, see this FAQ

    clear
    input str6 companyid int year float(red blue green)
    "001045" 2015 0 1 0
    "001045" 2015 0 1 0
    "001045" 2015 0 1 0
    "001045" 2015 0 1 0
    "001045" 2017 1 0 0
    "001045" 2017 1 0 0
    "001049" 2019 0 1 0
    "001049" 2019 0 0 1
    "001055" 2018 1 0 0
    "001055" 2018 0 1 0
    "001055" 2018 0 0 1
    end 
    
    collapse (max) red blue green, by(companyid year) 
    
    egen colors = concat(red blue green)
    
    list
    
         +-----------------------------------------------+
         | compan~d   year   red   blue   green   colors |
         |-----------------------------------------------|
      1. |   001045   2015     0      1       0      010 |
      2. |   001045   2017     1      0       0      100 |
      3. |   001049   2019     0      1       1      011 |
      4. |   001055   2018     1      1       1      111 |
         +-----------------------------------------------+