Search code examples
stata

Stata: Removing Non-unique duplicates


I want to retain a copy of each company-year observation considering my subyear_total variable in my data.

Some of my data has multiple entries for any given year as noted by copies.

Copies was created by:

bysort cik year: gen copies = _N

How can I remove the duplicates but keep one copy of the unique observation?

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year long cik float(subyear_total copies)
1999 1750   425000  1
2005 1750  4232000  1
2006 1750 1.60e+07  1
2007 1750   182444  3
2007 1750   182444  3
2007 1750   182444  3
2008 1750   710909  3
2008 1750   710909  3
2008 1750   710909  3
2009 1750  5155390  5
2009 1750  5155390  5
2009 1750  5155390  5
2009 1750  5155390  5
2009 1750  5155390  5
end

So for example:

2007 has 3 entries and I want to keep one of those and drop the rest. Same thing for 2008 and 2009 (which has 5 entries).

I if do drop if copies > 1 would I lose all instances of those years? How can I keep at least one?


Solution

  • The duplicates could be used here, but in your case

    bysort year cik : keep if _n == 1 
    

    gets you there directly. The variable copies is then of no obvious use.