I have lots of SiLK flow data that I would like to do some data mining on. It looks like the destination IP column matches the source IP column of a row of data further down. The rows (with many more columns) look like this:
UID SIP DIP PROTOCOL SPORT DPORT 720107626538 1207697420 3232248333 17 53 7722 720108826800 3232248333 1207697420 17 47904 53
I have never programmed in R or SPSS and am having trouble figuring out how to turn 2 rows of 27 columns of data into 1 row of 54 columns of data.
You can get corresponding SIP and DIP records on the same line through merge
:
df <- data.frame(
"UID" = c(720107626538, 720108826800),
"SIP" = c(1207697420, 3232248333),
"DIP" = c(3232248333, 1207697420),
"PROTOCOL" = c(17, 17),
"SPORT" = c(53, 47904),
"DPORT" = c(7722, 53),
stringsAsFactors = FALSE)
df_merged <- merge(
df[,setdiff(colnames(df), "DIP")],
df[,setdiff(colnames(df), "SIP")],
by.x = "SIP",
by.y = "DIP",
all = FALSE,
suffixes = c("_SIP", "_DIP"))
After that, you can use the UID fields to remove duplicates:
for(i in 2:nrow(df_merged)) {
ind <- df_merged$UID_DIP
ind[i] <- df_merged$UID_SIP[i]
df_merged <- df_merged[!duplicated(ind),]
}
df_merged
df_merged
SIP UID_SIP PROTOCOL_SIP SPORT_SIP DPORT_SIP UID_DIP PROTOCOL_DIP SPORT_DIP DPORT_DIP
1 1207697420 720107626538 17 53 7722 720108826800 17 47904 53
Because the de-duping relies on a loop, the whole thing could get very time-consuming if your dataset is large.