I'm working through a database with several repeated IDs entries and unique information per row as such
ID | country |
---|---|
A1 | AT |
A1 | BE |
A2 | CZ |
A3 | US |
A3 | UK |
A4 | NZ |
The same ID and different country implies there was collaboration between two different entities in this project.
How can I treat the dataset to tabulate collaborations? I would like to obtain
AT BE CZ NZ US UK
AT 1
BE 1
CZ
NZ 1 1
US 1 1
UK 1 1
Preferably I could have the option to remove duplicates to get
AT BE CZ NZ US UK
AT 1
BE
CZ
NZ 1 1
US
UK
and somehow get a variable to produce 'summarize'.
Thank you in advance (I can't get access to Stata Forum due to a firewall)
I tried creating variables that identify multi-country IDs and another that reads the country when it is the second or third linked to the same ID
ID | country | var 1 | var 2 |
---|---|---|---|
A1 | AT | 1 | |
A1 | BE | 2 | |
A2 | CZ | ||
A3 | US | 1 | |
A3 | UK | 2 | |
A4 | NZ | 3 |
Here is a solution:
clear
input str2(id country)
"A1" "AT"
"A1" "BE"
"A2" "CZ"
"A3" "US"
"A3" "UK"
"A4" "NZ"
"A4" "US"
end
* Get locals with all country abbreviation and the number of countries
levelsof country, local(countries)
local num_countries : list sizeof countries
* Collapse each project to one row where the last row has all abbreviations
* of countries in that project. Keep only the last row for each project
sort id country
replace country = country[_n-1] + " " + country if id[_n-1] == id
by id : keep if _n == _N
* Get a local with all combination of collaborations
levelsof country, local(combos)
* Loop over each country
foreach country of local countries {
* For each country get a list of collaborations that country is part of
foreach combo of local combos {
if strpos("`combo'","`country'") {
local `country'_colab "``country'_colab' `combo'"
}
}
*Remove duplicates from list
local `country'_colab : list uniq `country'_colab
*Remove the country itself from that country's list
local `country'_colab = subinstr("``country'_colab'","`country'","",.)
*Remove all excessive spaces for no other reason than neatness
local `country'_colab = trim(itrim("``country'_colab'"))
*Display all collaborations
di "Country `country' collaborats with countries: ``country'_colab'"
}
* Clear the original data and create a new data set with one obs per country
clear
set obs `num_countries'
* Create one row and row column for each country.
* Set the default to 0 for no collaboration
gen country = ""
local row = 1
foreach country of local countries {
gen `country' = 0
replace country = "`country'" if _n == `row++'
}
* Go over the list of collaborations for each country and set value to 1
* where the row and column country collaborates
foreach country of local countries {
foreach colab of local `country'_colab {
replace `colab' = 1 if country == "`country'"
}
}
I have not cleared duplicates as you suggested as a preference as I think you get an ambivalent data set as it is not clear for which country you should drop a row and for which you should drop a column.