Search code examples
rqdap

sentSplit() in qdap has issues when there are no endmarks


I am using the qdap package for polarity analysis. In the CSV file I have a sentence without punctuation like "Sucks to not be removable" (no period). After using sentsplit on the dataframe, this row is showing NA.

How do I add endmarks to the incomplete sentences in R? Is there a way to stop this?


Solution

  • Many of the qdap functions expect properly formatted/structured data forms. This generally means sentences with endmarks and often only one sentence per row. This is how the algorithms determine what is a sentence. If the sentences are indeed incomplete sentences qdap expects the pipe sign "|" to denote this. So here's an example where detect missing endmarks with end_mark function and then paste a | at the end:

    dat <- DATA
    dat[1, 4] <- "Sucks to not be removable"
    missing <- end_mark(dat[["state"]]) == "_"
    dat[["state"]][missing] <- paste0(dat[["state"]][missing], "|")
    
    sentSplit(dat, "state")
    
    ##        person  tot sex adult code                       state
    ## 1         sam  1.1   m     0   K1  Sucks to not be removable|
    ## 2        greg  2.1   m     0   K2     No it's not, it's dumb.
    ## 3     teacher  3.1   m     1   K3          What should we do?
    ## 4         sam  4.1   m     0   K4        You liar, it stinks!
    ## 5        greg  5.1   m     0   K5     I am telling the truth!
    ## 6       sally  6.1   f     0   K6      How can we be certain?
    ## 7        greg  7.1   m     0   K7            There is no way.
    ## 8         sam  8.1   m     0   K8             I distrust you.
    ## 9       sally  9.1   f     0   K9 What are you talking about?
    ## 10 researcher 10.1   f     1  K10           Shall we move on?
    ## 11 researcher 10.2   f     1  K10                  Good then.
    ## 12       greg 11.1   m     0  K11                 I'm hungry.
    ## 13       greg 11.2   m     0  K11                  Let's eat.
    ## 14       greg 11.3   m     0  K11                You already?
    

    Incidentally, the dev version of qdap (version >= 2.1.1) contains a new line of data formatting functions including check_text to automatically check for potential formatting issues and print a report that gives the location of potential problems and possible fixes.