Search code examples
rtwittertext-mining

Using grep function for text mining


I have problem while scoring my data. Below is the data set. text are the tweets from where I want to do text mining and sentiment analysis

**text**                                         **call    bills    location**
-the bill was not generated                           0        bill       0
-tried to raise the complaint                         0         0         0 
-the location update failed                           0         0       location
-the call drop has increased in my location         call        0       location
-nobody in the location received bill,so call ASAP  call      bill      location

THIS IS THE DUMMY DATA, where Text is the column from where I am trying to do text mining, I have used grep function in R to create columns(e.g. bills, calls, location) and if bills is there in any row, under the column name write bill and likewise for all the other categories.

vdftweet$app = ifelse(grepl('app',tolower(vdftweet$text)),'app',0)
table(vdftweet$app)

Now, the problem which I am not able to understand is

I want to create a new column "category_name", under which each row should give the name of the category they fall into. if there are more than 3 category for each tweet mark it as 'other'. Else give the names of category.


Solution

  • There are a couple of ways you could do this using the tidyverse package. In the first method, mutate is used to add the category names as columns to the text data.frame similar to what you have. gather is then used to transform that to key-value format in which the categories are values in the category_name column.

    The Alternative approach is to go directly to the key-value format in which categories are values in the category_name column. Rows are repeated if they fall into multiple categories. If you don't need the first form with the categories as column names, the Alternative approach is more flexible for adding new categories and requires less processing.

    In both methods, str_match contains the regular expression matching the category to the text. The pattern here is trivial but a more complex pattern could be used if needed.

    The code follows:

    library(tidyverse)
    #
    # read dummy data into data frame
    #
       dummy_dat <- read.table(header = TRUE,stringsAsFactors = FALSE, 
                          strip.white=TRUE, sep="\n",
              text= "text
                -the bill was not generated
              -tried to raise the complaint
              -the location update failed
              -the call drop has increased in my location
              -nobody in the location received bill,so call ASAP")
    #
    #  form data frame with categories as columns
    #
       dummy_cats <-  dummy_dat %>% mutate(text = tolower(text),
                                   bill = str_match(.$text, pattern="bill"), 
                                   call = str_match(.$text,  pattern="call"), 
                                   location = str_match(.$text, pattern="location"),
                                   other = ifelse(is.na(bill) & is.na(call) &
                                                  is.na(location), "other",NA))
    #
    #  convert categories as columns to key-value format
    #  withcategories as values in category_name column
    #
    
       dummy_cat_name <- dummy_cats %>% 
                   gather(key = type, value=category_name, -text,na.rm = TRUE) %>%
                   select(-type) 
    
    #
    #---------------------------------------------------------------------------
    #
    #  ALTERNATIVE:  go directly from text data to key-value format with categories
    #  as values under category_name
    #  Rows are repeated if they fall into multiple categories
    #  Rows with no categories are put in category other
    #
       dummy_dat <- dummy_dat %>% mutate(text=tolower(text))
       dummy_cat_name1 <- data.frame(text = NULL, category_name =NULL)
       for( cat in c("bill", "call", "location")) {
          temp <-  dummy_dat %>% mutate(category_name = str_match(.$text, pattern=cat)) %>% na.omit() 
          dummy_cat_name1 <- dummy_cat_name1 %>% bind_rows(temp) 
        }
        dummy_cat_name1 <- left_join(dummy_dat, dummy_cat_name1, by = "text") %>%
                   mutate(category_name = ifelse(is.na(category_name), "other", category_name))
    

    The result is

     dummy_cat_name1
                                                text      category_name
                                -the bill was not generated          bill
                              -tried to raise the complaint         other
                                -the location update failed      location
                -the call drop has increased in my location          call
                -the call drop has increased in my location      location
         -nobody in the location received bill,so call asap          bill
         -nobody in the location received bill,so call asap          call
         -nobody in the location received bill,so call asap      location