Search code examples
sortingstatatabulate

Sorting a collaborative database


I'm working through a database with several repeated IDs entries and unique information per row as such

ID country
A1 AT
A1 BE
A2 CZ
A3 US
A3 UK
A4 NZ

The same ID and different country implies there was collaboration between two different entities in this project.

How can I treat the dataset to tabulate collaborations? I would like to obtain

   AT BE CZ NZ US UK
AT    1
BE 1
CZ       
NZ             1  1
US           1    1  
UK           1 1


Preferably I could have the option to remove duplicates to get

   AT BE CZ NZ US UK
AT    1
BE 
CZ       
NZ             1  1
US           
UK           

and somehow get a variable to produce 'summarize'.

Thank you in advance (I can't get access to Stata Forum due to a firewall)

I tried creating variables that identify multi-country IDs and another that reads the country when it is the second or third linked to the same ID

ID country var 1 var 2
A1 AT 1
A1 BE 2
A2 CZ
A3 US 1
A3 UK 2
A4 NZ 3

Solution

  • Here is a solution:

    clear
    input str2(id country)
    "A1" "AT"
    "A1" "BE"
    "A2" "CZ"
    "A3" "US"
    "A3" "UK"
    "A4" "NZ"
    "A4" "US"
    end
    
    * Get locals with all country abbreviation and the number of countries
    levelsof country, local(countries)
    local num_countries : list sizeof countries
    
    * Collapse each project to one row where the last row has all abbreviations 
    * of countries in that project. Keep only the last row for each project
    sort id country
    replace country = country[_n-1] + " " + country if id[_n-1] == id
    by id : keep if _n == _N
    
    * Get a local with all combination of collaborations
    levelsof country, local(combos)
    
    * Loop over each country
    foreach country of local countries {
        
        * For each country get a list of collaborations that country is part of
        foreach combo of local combos {
            if strpos("`combo'","`country'") {
                local `country'_colab "``country'_colab' `combo'"
            }        
        }
        
        *Remove duplicates from list
        local `country'_colab : list uniq `country'_colab
        *Remove the country itself from that country's list
        local `country'_colab = subinstr("``country'_colab'","`country'","",.)
        *Remove all excessive spaces for no other reason than neatness
        local `country'_colab = trim(itrim("``country'_colab'"))
        
        *Display all collaborations
        di "Country `country' collaborats with countries: ``country'_colab'"
    }
    
    * Clear the original data and create a new data set with one obs per country
    clear
    set obs `num_countries'
    
    * Create one row and row column for each country. 
    * Set the default to 0 for no collaboration
    gen country = ""
    local row = 1
    foreach country of local countries {
        gen `country' = 0
        replace country = "`country'" if _n == `row++'
    }
    
    * Go over the list of collaborations for each country and set value to 1
    * where the row and column country collaborates
    foreach country of local countries {
        foreach colab of local `country'_colab {
            replace `colab' = 1 if country == "`country'"
        }
    }
    

    I have not cleared duplicates as you suggested as a preference as I think you get an ambivalent data set as it is not clear for which country you should drop a row and for which you should drop a column.