I'm trying to prep data to be used in for various network visualisation applications in R and also Gephi. These formats want numeric identifiers that link between two databases. I have figured out the latter part, but I'm not able to find a succinct way to create a numeric ID variable across columns in a dataframe. Here's some replicable code that illustrates what I'm trying to do.
org.data <- data.frame(source=c('bob','sue','ann','john','sinbad'),
target=c('sinbad','turtledove','Aerosmith','bob','john'))
desired.data <- data.frame(source=c('1','2','3','4','5'),
target=c('5','6','7','1','4'))
org.data
source target
1 bob sinbad
2 sue turtledove
3 ann Aerosmith
4 john bob
5 sinbad john
desired.data
source target
1 1 5
2 2 6
3 3 7
4 4 1
5 5 4
Here's a base R method using match
on the unlisted unique names in the original data.frame.
To replace the current data.frame, use
org.data[] <- sapply(org.data, match, table=unique(unlist(org.data)))
Here, sapply
loops through the variables in org.data, and applies match
to each. match
returns the position of of the first argument in the table argument. Here, table is the unlisted unique elements in org.data: unique(unlist(org.data))
. In this case, sapply
returns a matrix. It is converted to a data.frame, replacing the original by appending []
to org.data in org.data[] <-
. This construction can be thought of as preserving the structure of the original object during the assignment.
To construct a new data.frame, use
setNames(data.frame(sapply(org.data, match, table=unique(unlist(org.data)))),
names(org.data))
Or better, as Henrik suggests, it would probably be easier to first create a copy of the data.frame and then use the first line of code to fill in the copy rather than using setNames
and data.frame
.
desired.data <- org.data
Both of these return
source target
1 1 5
2 2 6
3 3 7
4 4 1
5 5 4