Search code examples
rlistsubsettweets

R subset data according list


I am working on Twitter dataset and I haven't figure out subsetting my data according list of hashtags.

df:

rowID                Hashtags
 1                   ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
 2                   onlarkonusurakpartiyapar,halkinbasbakanitokatta
 3                   kurdish,mahabad,justiceforfarinaz,kurdistan
 4                   onlarkonusurakpartiyapar
 5                   anfal,halabja,kurdistan,kobani
 6                   onlarkonusurakpartiyapar
 7                   kurdistan

Hashtags are a character list

hashtag_list:

"onlarkonusurakpartiyapar" "kurdistan"

I tried this code but it didn't work for me;

new_df=df[df$Hashtags %in% hashtag_list,]

It can only give the subset of "onlarkonusurakpartiyapar" hashtag. I know that it looks so simple but I couldn't figure out yet even though I have looked all posts in the site. Thanks for your help.


Solution

  • Here is an approach that modifies yours by distinguishing characters separated by a "," to be different hashtag, and saying that the row is a match if any of those hashtags are in your list.

    Your Data

    df <- data.frame(
        rowID=1:8, 
        Hashtags=c(
            "ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar", 
            "onlarkonusurakpartiyapar,halkinbasbakanitokatta",
            "kurdish,mahabad,justiceforfarinaz,kurdistan",
            "onlarkonusurakpartiyapar",
            "anfal,halabja,kurdistan,kobani",
            "onlarkonusurakpartiyapar",
            "kurdistan",
            "this,willnot,befound"
        ), 
        stringsAsFactors=F
    )
    hashtag_list <- c("onlarkonusurakpartiyapar", "kurdistan")
    

    The Solution

    find_ht <- function(hashtags, hashtag_list){
        sapply(strsplit(hashtags, split=","), function(x)any(x%in%hashtag_list))
    }
    

    Implementation

    find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
    

    which returns ...

    [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
    

    Edit

    To perform the subset, you simply need to ...

    sub.index <- find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
    df[sub.index,]
    

    which returns

     rowID                                                     Hashtags
    1     1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
    2     2              onlarkonusurakpartiyapar,halkinbasbakanitokatta
    3     3                  kurdish,mahabad,justiceforfarinaz,kurdistan
    4     4                                     onlarkonusurakpartiyapar
    5     5                               anfal,halabja,kurdistan,kobani
    6     6                                     onlarkonusurakpartiyapar
    7     7                                                    kurdistan
    

    Or, if you want the indices do which(sub.index). To Specifically subset the rowID's only, do df[sub.index,"rowID"]. In this case, both of those return [1] 1 2 3 4 5 6 7