I have two of large files, the contents of the files looks like:
df1
df2
dput of
df1
structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L), .Label = c("00:00,location,long|", "00:00,location,long|00:00,location,same|",
"00:00,location,long|00:00,runapps,com.sol.sviewcall|00:00,screen,OFF|",
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,runapps,com.kakao.talk|00:00,screen,OFF|",
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,wifi,dlink, iptime|00:00,wifi,dlink|",
"00:00,location,long|00:00,wifi,dlink|", "00:00,location,long|00:00,wifi,dlink|00:00,location,same|00:00,wifi,dlink, iptime|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA,
-183L))
df2
structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), .Label = c("00:00,location,long|",
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, iptime|",
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, SK_WiFi26C4, U+zone, U+Net642B|",
"00:00,location,long|00:00,wifi,dlink, SK_WiFi26C4|", "01:00,location,long|",
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,screen,OFF|01:00,runapps,com.kakao.talk|",
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,wifi,dlink, iptime, SK_WiFi26C4|01:00,wifi,dlink, iptime, PISnet_4D9740|01:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|01:00,runapps,com.buzzpia.aqua.launcher|01:00,screen,OFF|",
"01:00,location,long|01:00,screen,OFF|", "02:00,location,long|02:00,wifi,dlink, iptime, SK_WiFi26C4|02:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|02:00,wifi,dlink, iptime, KT_WLAN_BBE3|02:00,runapps,com.kakao.talk|02:00,screen,OFF|",
"02:00,location,long|02:00,wifi,dlink, iptime|02:00,runapps,com.buzzpia.aqua.launcher|02:00,runapps,com.android.mms|02:00,screen,OFF|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA,
-232L))
My questions are:
I want to know the percentage of matching data of all rows, for example how many rows which has same data between df1 and df2.
I want to know the percentage of similarity data of all rows, one of the data looks like "00:,location,long" I use delimiter "|" to separate one data to others. In this case, if one row in df1 and df2 >= 75 % similar, I consider that rows are similar. for example the rows contains three data, and two data is same, one data is different, that is similar
So, I want to calculate, the percentage of matching rows (how many rows in df1 match with rows in df2), and the percentage of similarity rows (how many rows in df1 similar with rows in df2), and the percentage of different rows (how many rows in df1 different with rows in df2)
The base data is df1, I mean I want to know how many rows which match, similar, or different of the df2 to df1
I use R language, I've tried but I stuck. Hope someone can give a light to me
I Guess your question is to find the all of rows in df2 that not in df1 or all of rows in df2 that in df1.
If that you mean, you can use sqldf
library
library(sqldf)
df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
df2Indf1 <- sqldf('SELECT * FROM df2 INTERSECT SELECT * FROM df1')
Another way, you can use dplyr
library(dplyr)
anti_join(df2,df1)
semi_join(df2,df1)
For the similarity, if you mean to measure the score of similarity between two strings data, you can use Levenshtein Distance
see the details in this link . You can apply this to your data frame.