Search code examples
rdataframemappingadjacency-matrix

Mapping dataframe column values to a n by n matrix


I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.

The data looks like this

# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")

head(example_data, n = 10)
   rid pid TradeValue
1    2   3        500
2    2   7       2328
3    2   8    2233465
4    2   9      81470
5    2  12     572893
6    2  17     488374
7    2  19    3314932
8    2  23      20323
9    2  25         10
10   2  29    9026220

length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161

where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.

However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.

What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:

  1. Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
  2. By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.

For example, for a 5-country dataframe of the following form

rid pid TradeValue
2   1   50
2   3   45
2   4   7
2   5   18
3   1   24
3   2   45
3   4   88
3   5   12
5   1   27
5   2   18
5   3   12
5   4   92

The desired output should look like this

     pid_1  pid_2   pid_3   pid_4   pid_5
rid_1    0    50      24        0      27
rid_2   50     0      45        7      18
rid_3   24    45       0       88      12
rid_4    0     7      88        0      92
rid_5   27    18      12       92       0

but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.


Solution

  • df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
    df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
    
    data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
    
    #    rid pid_1 pid_2 pid_3 pid_4 pid_5
    #1 rid_1     0     0     0     0     0
    #2 rid_2    50     0    45     7    18
    #3 rid_3    24    45     0    88    12
    #4 rid_4     0     0     0     0     0
    #5 rid_5    27    18    12    92     0
    

    The secrets/ tricks:

    • use factor variables to tell R what values are all possible as well as the order.

    • in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)