I have problem while scoring my data. Below is the data set. text are the tweets from where I want to do text mining and sentiment analysis
**text** **call bills location**
-the bill was not generated 0 bill 0
-tried to raise the complaint 0 0 0
-the location update failed 0 0 location
-the call drop has increased in my location call 0 location
-nobody in the location received bill,so call ASAP call bill location
THIS IS THE DUMMY DATA, where Text is the column from where I am trying to do text mining, I have used grep function in R to create columns(e.g. bills, calls, location) and if bills is there in any row, under the column name write bill and likewise for all the other categories.
vdftweet$app = ifelse(grepl('app',tolower(vdftweet$text)),'app',0)
table(vdftweet$app)
Now, the problem which I am not able to understand is
I want to create a new column "category_name", under which each row should give the name of the category they fall into. if there are more than 3 category for each tweet mark it as 'other'. Else give the names of category.
There are a couple of ways you could do this using the tidyverse
package. In the first method, mutate
is used to add the category names as columns to the text data.frame similar to what you have. gather
is then used to transform that to key-value format in which the categories are values in the category_name
column.
The Alternative approach is to go directly to the key-value format in which categories are values in the category_name
column. Rows are repeated if they fall into multiple categories. If you don't need the first form with the categories as column names, the Alternative approach is more flexible for adding new categories and requires less processing.
In both methods, str_match
contains the regular expression matching the category to the text. The pattern here is trivial but a more complex pattern could be used if needed.
The code follows:
library(tidyverse)
#
# read dummy data into data frame
#
dummy_dat <- read.table(header = TRUE,stringsAsFactors = FALSE,
strip.white=TRUE, sep="\n",
text= "text
-the bill was not generated
-tried to raise the complaint
-the location update failed
-the call drop has increased in my location
-nobody in the location received bill,so call ASAP")
#
# form data frame with categories as columns
#
dummy_cats <- dummy_dat %>% mutate(text = tolower(text),
bill = str_match(.$text, pattern="bill"),
call = str_match(.$text, pattern="call"),
location = str_match(.$text, pattern="location"),
other = ifelse(is.na(bill) & is.na(call) &
is.na(location), "other",NA))
#
# convert categories as columns to key-value format
# withcategories as values in category_name column
#
dummy_cat_name <- dummy_cats %>%
gather(key = type, value=category_name, -text,na.rm = TRUE) %>%
select(-type)
#
#---------------------------------------------------------------------------
#
# ALTERNATIVE: go directly from text data to key-value format with categories
# as values under category_name
# Rows are repeated if they fall into multiple categories
# Rows with no categories are put in category other
#
dummy_dat <- dummy_dat %>% mutate(text=tolower(text))
dummy_cat_name1 <- data.frame(text = NULL, category_name =NULL)
for( cat in c("bill", "call", "location")) {
temp <- dummy_dat %>% mutate(category_name = str_match(.$text, pattern=cat)) %>% na.omit()
dummy_cat_name1 <- dummy_cat_name1 %>% bind_rows(temp)
}
dummy_cat_name1 <- left_join(dummy_dat, dummy_cat_name1, by = "text") %>%
mutate(category_name = ifelse(is.na(category_name), "other", category_name))
The result is
dummy_cat_name1
text category_name
-the bill was not generated bill
-tried to raise the complaint other
-the location update failed location
-the call drop has increased in my location call
-the call drop has increased in my location location
-nobody in the location received bill,so call asap bill
-nobody in the location received bill,so call asap call
-nobody in the location received bill,so call asap location