Search code examples
rdataframecomparesimilarity

Find match and similarity in two of files R language


I have two of large files, the contents of the files looks like:

df1

enter image description here

df2

enter image description here

dput of

df1

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L), .Label = c("00:00,location,long|", "00:00,location,long|00:00,location,same|", 
"00:00,location,long|00:00,runapps,com.sol.sviewcall|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,runapps,com.kakao.talk|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,wifi,dlink, iptime|00:00,wifi,dlink|", 
"00:00,location,long|00:00,wifi,dlink|", "00:00,location,long|00:00,wifi,dlink|00:00,location,same|00:00,wifi,dlink, iptime|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-183L))

df2

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), .Label = c("00:00,location,long|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, iptime|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, SK_WiFi26C4, U+zone, U+Net642B|", 
"00:00,location,long|00:00,wifi,dlink, SK_WiFi26C4|", "01:00,location,long|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,screen,OFF|01:00,runapps,com.kakao.talk|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,wifi,dlink, iptime, SK_WiFi26C4|01:00,wifi,dlink, iptime, PISnet_4D9740|01:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|01:00,runapps,com.buzzpia.aqua.launcher|01:00,screen,OFF|", 
"01:00,location,long|01:00,screen,OFF|", "02:00,location,long|02:00,wifi,dlink, iptime, SK_WiFi26C4|02:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|02:00,wifi,dlink, iptime, KT_WLAN_BBE3|02:00,runapps,com.kakao.talk|02:00,screen,OFF|", 
"02:00,location,long|02:00,wifi,dlink, iptime|02:00,runapps,com.buzzpia.aqua.launcher|02:00,runapps,com.android.mms|02:00,screen,OFF|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-232L))

My questions are:

  1. I want to know the percentage of matching data of all rows, for example how many rows which has same data between df1 and df2.

  2. I want to know the percentage of similarity data of all rows, one of the data looks like "00:,location,long" I use delimiter "|" to separate one data to others. In this case, if one row in df1 and df2 >= 75 % similar, I consider that rows are similar. for example the rows contains three data, and two data is same, one data is different, that is similar

  3. I want to know the percentage of different data of all rows in df1 and df2

So, I want to calculate, the percentage of matching rows (how many rows in df1 match with rows in df2), and the percentage of similarity rows (how many rows in df1 similar with rows in df2), and the percentage of different rows (how many rows in df1 different with rows in df2)

The base data is df1, I mean I want to know how many rows which match, similar, or different of the df2 to df1

I use R language, I've tried but I stuck. Hope someone can give a light to me


Solution

  • I Guess your question is to find the all of rows in df2 that not in df1 or all of rows in df2 that in df1. If that you mean, you can use sqldf library

    library(sqldf)
    
    df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
    df2Indf1 <- sqldf('SELECT * FROM df2 INTERSECT SELECT * FROM df1')
    

    Another way, you can use dplyr

    library(dplyr)
    anti_join(df2,df1)
    semi_join(df2,df1)
    

    For the similarity, if you mean to measure the score of similarity between two strings data, you can use Levenshtein Distance see the details in this link . You can apply this to your data frame.