Search code examples
rfunctionnlp

R Function for Return New Column to Dataset


I'm currently working with a dataset with different speakers and am trying to extract the amount of words in a utterance. I am also trying to count the number of backchannels (utterances with three or less words). These metrics would be used for further analysis of the dataset. Please see a slice of the data below.

speaker <- c("P6", "P4", "P5", "P6", "P6")
utterance <- c("Alright", "So this is a social talk right? So we’re only supposed to talk about work only", "yeah", "And that’s the thing, so it’s not clear to me we need to work or just to be social.", "But a bit and a bit")
df <- data.frame(speaker, utterance)

These are the different functions I've tried out. The issue is that I would like to store the results in a new column in the same dataframe, but this is something I haven't been able to do yet (I'm a beginner with R). I can see with the code below that the first function works as intended, but I am having some issues with the second one. Ideally I would like both functions to just accept the generic column name rather than the dataframe.

#function for utterance length
utterance_length <- function(df){
  df <- df %>%
  mutate(utterance_length = str_count(df$utterance,"\\S+"))
  return(df)}


#function for backchannelling 
backchannelling <- function(df){
  df$backchannelling <- ifelse(df$utterance_length > 3, 0, 1)
  return(df)
}

How can I: 1) save the new utterance_length column to the data frame (same goes to the backchannelling function); 2) only input column names in the function rather than the dataframe.


Solution

  • This should solve your problems:

    library(dplyr)
    library(stringr)
    
    speaker <- c("P6", "P4", "P5", "P6", "P6")
    utterance <- c("Alright", "So this is a social talk right? So we’re only supposed to talk about work only", "yeah", "And that’s the thing, so it’s not clear to me we need to work or just to be social.", "But a bit and a bit")
    df <- data.frame(speaker, utterance)
    
    # Function for utterance length.
    utterance_length <- function(dta, column_name)
    {
      # Inputs are a data frame and the name of the column storing utterances.
      # Output is number of words for each utterance (vector).
      
      utterance_length = str_count(dta[, column_name],"\\S+")
      return(utterance_length)
    }
    
    # Function for backchannelling.
    backchannelling <- function(dta, column_name)
    {
      # Inputs are a data frame and the name of the column storing utterances.
      # Output is 1 for backchannels, 0 otherwise (vector).
      
      backchannelling <- ifelse(str_count(dta[, column_name],"\\S+") > 3, 0, 1)
      return(backchannelling)
    }
    
    # Creating new columns.
    df$lengths = utterance_length(df, "utterance")
    df$backch = backchannelling(df, "utterance")
    

    So now functions take as input the name of the column storing utterances and the data frame. While it is in principle possible to avoid using the latter as input, that would result in a loss of generality, where every time you use a different data frame (or change the name of the data set) both functions must be modified accordingly. Results:

    df
    
    # speaker                                                                           utterance lengths backch
    # 1      P6                                                                             Alright       1      1
    # 2      P4      So this is a social talk right? So we’re only supposed to talk about work only      16      0
    # 3      P5                                                                                yeah       1      1
    # 4      P6 And that’s the thing, so it’s not clear to me we need to work or just to be social.      19      0
    # 5      P6                                                                 But a bit and a bit       6      0