I am working on Twitter dataset and I haven't figure out subsetting my data according list of hashtags.
df:
rowID Hashtags
1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 onlarkonusurakpartiyapar
5 anfal,halabja,kurdistan,kobani
6 onlarkonusurakpartiyapar
7 kurdistan
Hashtags are a character list
hashtag_list:
"onlarkonusurakpartiyapar" "kurdistan"
I tried this code but it didn't work for me;
new_df=df[df$Hashtags %in% hashtag_list,]
It can only give the subset of "onlarkonusurakpartiyapar" hashtag. I know that it looks so simple but I couldn't figure out yet even though I have looked all posts in the site. Thanks for your help.
Here is an approach that modifies yours by distinguishing characters separated by a "," to be different hashtag, and saying that the row is a match if any of those hashtags are in your list.
df <- data.frame(
rowID=1:8,
Hashtags=c(
"ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar",
"onlarkonusurakpartiyapar,halkinbasbakanitokatta",
"kurdish,mahabad,justiceforfarinaz,kurdistan",
"onlarkonusurakpartiyapar",
"anfal,halabja,kurdistan,kobani",
"onlarkonusurakpartiyapar",
"kurdistan",
"this,willnot,befound"
),
stringsAsFactors=F
)
hashtag_list <- c("onlarkonusurakpartiyapar", "kurdistan")
find_ht <- function(hashtags, hashtag_list){
sapply(strsplit(hashtags, split=","), function(x)any(x%in%hashtag_list))
}
find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
which returns ...
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
To perform the subset, you simply need to ...
sub.index <- find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
df[sub.index,]
which returns
rowID Hashtags
1 1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 4 onlarkonusurakpartiyapar
5 5 anfal,halabja,kurdistan,kobani
6 6 onlarkonusurakpartiyapar
7 7 kurdistan
Or, if you want the indices do which(sub.index)
. To Specifically subset the rowID
's only, do df[sub.index,"rowID"]
. In this case, both of those return [1] 1 2 3 4 5 6 7