Search code examples
rdataframedatatablecountdata-cleaning

How to find the number of variables in common between variables two by two in a column R compared to variables in another column?


I'm working on R, I have a character data table like that :

Species<- c("lion", "tiger", "lion","tiger","donkey","lion","donkey")
Countries <- c("Tanzania", "Tanzania", "Kenya","Kenya","Italia","Niger","France")

df <- data.frame(Species, Countries)
Species Countries
lion Tanzania
tiger Tanzania
lion Kenya
tiger Kenya
donkey Italia
lion Niger
donkey France

With this table, I would like to have for each country of my dataframe the number of species in common for the countries 2 by 2, and have a final table like this one:

Countries1 Countries2 number_common_species
Kenya Tanzania 2
Italia Tanzania 0
Italia Kenya 0
Niger Kenya 1
Niger Tanzania 1
Italia Niger 0
Italia France 1
Tanzania France 0
Kenya France 0
Niger France 0

I have a really big dataset with a lot of species.

Does anyone know how I can do this?


Solution

  • First make a data frame of each pairwise combination of countries, then find the length of the set of species that each pair has in common:

    counts <- data.frame(t(combn(unique(dat$Countries),2)))
    counts$count <- apply(counts, 1, function(x) length(intersect( dat$Species[dat$Countries==x[1]], 
                                                                   dat$Species[dat$Countries==x[2]])))
    counts
    
             X1     X2 count
    1  Tanzania  Kenya     2
    2  Tanzania Italia     0
    3  Tanzania  Niger     1
    4  Tanzania France     0
    5     Kenya Italia     0
    6     Kenya  Niger     1
    7     Kenya France     0
    8    Italia  Niger     0
    9    Italia France     1
    10    Niger France     0