I'm currently trying to measure the Jaccard Distance between tweets in a dataset
This is where the dataset is
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
I've tried a few things to measure the distance
This is what I have so far
I saved the linked dataset to a file called Tweets.json
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
Then I converted json_alldata to tweet.features and got rid of the geo column
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
These are what the first two tweets look like
tweet.features$text[1]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
First thing I tried was using the method stringdist
which is under the stringdist library
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
When I run that, I get
[1] 0.1621622
I'm not sure that's correct, though. A intersection B = 23, and A union B = 25. The Jaccard distance is A intersection B/A union B -- right? So by my calculation, the Jaccard distance should be 0.92?
So I figured I could do it by sets. Simply calculate intersection and union and divide
This is what I tried
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
When I try to do intersection, I get this: The output is just list()
Intersection <- intersect(A1, A2)
list()
When I try Union, I get this:
union(A1, A2)
[[1]]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
This doesn't seem to be grouping the words into a single set.
I figured I'd be able to divide the intersection by the union. But I guess I would need the program to count the number or words in each set, then do the calculations.
Needless to say, I'm a bit stuck and I'm not sure if I'm on the right track.
Any help would be appreciated. Thank you.
intersect
and union
expect vectors (as.set
does not exist). I think you want to compare words so you can use strsplit
but the way the split is done belongs to you. An example below:
tweet.features <- list(tweet1="RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
Is this want you want?
The strsplit
is here done for every space or dot. You may want to refine the split
argument from strsplit
and replace " |\\."
for something more specific (see ?regex
).