Search code examples
ranalysisspssnetwork-traffic

How do I pair up rows of network traffic data in R or SPSS?


I have lots of SiLK flow data that I would like to do some data mining on. It looks like the destination IP column matches the source IP column of a row of data further down. The rows (with many more columns) look like this:

UID SIP DIP PROTOCOL    SPORT   DPORT
720107626538    1207697420  3232248333  17  53  7722
720108826800    3232248333  1207697420  17  47904   53

I have never programmed in R or SPSS and am having trouble figuring out how to turn 2 rows of 27 columns of data into 1 row of 54 columns of data.


Solution

  • You can get corresponding SIP and DIP records on the same line through merge:

    df <- data.frame(
      "UID" = c(720107626538, 720108826800),
      "SIP" = c(1207697420, 3232248333),
      "DIP" = c(3232248333, 1207697420),
      "PROTOCOL" = c(17, 17),
      "SPORT" = c(53, 47904),
      "DPORT" = c(7722, 53),
      stringsAsFactors = FALSE)
    
    df_merged <- merge(
      df[,setdiff(colnames(df), "DIP")],
      df[,setdiff(colnames(df), "SIP")],
      by.x = "SIP",
      by.y = "DIP",
      all = FALSE,
      suffixes = c("_SIP", "_DIP"))
    

    After that, you can use the UID fields to remove duplicates:

    for(i in 2:nrow(df_merged)) {
      ind <- df_merged$UID_DIP
      ind[i] <- df_merged$UID_SIP[i]
      df_merged <- df_merged[!duplicated(ind),]
    }
    
    df_merged
    
    df_merged
             SIP      UID_SIP PROTOCOL_SIP SPORT_SIP DPORT_SIP      UID_DIP PROTOCOL_DIP SPORT_DIP DPORT_DIP
    1 1207697420 720107626538           17        53      7722 720108826800           17     47904        53
    

    Because the de-duping relies on a loop, the whole thing could get very time-consuming if your dataset is large.