Search code examples
rnumericrecode

How to create a unique identifier ID across columns?


I'm trying to prep data to be used in for various network visualisation applications in R and also Gephi. These formats want numeric identifiers that link between two databases. I have figured out the latter part, but I'm not able to find a succinct way to create a numeric ID variable across columns in a dataframe. Here's some replicable code that illustrates what I'm trying to do.

org.data <- data.frame(source=c('bob','sue','ann','john','sinbad'),
       target=c('sinbad','turtledove','Aerosmith','bob','john'))

desired.data <- data.frame(source=c('1','2','3','4','5'),
                       target=c('5','6','7','1','4'))


org.data

  source     target
1    bob     sinbad
2    sue     turtledove
3    ann     Aerosmith
4    john    bob
5    sinbad  john

desired.data

  source target
1    1      5
2    2      6
3    3      7
4    4      1
5    5      4

Solution

  • Here's a base R method using match on the unlisted unique names in the original data.frame.

    To replace the current data.frame, use

    org.data[] <- sapply(org.data, match, table=unique(unlist(org.data)))
    

    Here, sapply loops through the variables in org.data, and applies match to each. match returns the position of of the first argument in the table argument. Here, table is the unlisted unique elements in org.data: unique(unlist(org.data)). In this case, sapply returns a matrix. It is converted to a data.frame, replacing the original by appending [] to org.data in org.data[] <-. This construction can be thought of as preserving the structure of the original object during the assignment.

    To construct a new data.frame, use

    setNames(data.frame(sapply(org.data, match, table=unique(unlist(org.data)))),
             names(org.data))
    

    Or better, as Henrik suggests, it would probably be easier to first create a copy of the data.frame and then use the first line of code to fill in the copy rather than using setNames and data.frame.

    desired.data <- org.data
    

    Both of these return

      source target
    1      1      5
    2      2      6
    3      3      7
    4      4      1
    5      5      4