Search code examples
rregexapostrophe

keeping the apostrophe using the textcnt function from the tau package in R


The textcnt function in R's tau package has a split argument and it's default value is split = "[[:space:][:punct:][:digit:]]+" ç this argumet uses the apostrophe ' to split into words too and I don't want that, how can I modify the argument so it doesn't use the apostrophe to split words?

this code:

`library(tau) text<-"I don't want the function to use the ' to split"

textcnt(text, split = "[[:space:][:punct:][:digit:]]+",method="string",n=1L)`

produces this output:

 don function        i    split        t      the       to      use     want 
   1        1        1        1        1        2        2        1        1 

instead of having don 1 and t 1, i would like to keep don't as 1 word

I have tried to use str_replace_all from stringr to remove the punctuation beforehand and then omit the punct part of the argument in textcnt but then it doesn't use all kind of symbols such as & > or " to split, I have tried to modify the split argument but then it doesn't split the sentence at all or it keeps the symbols

Thank you


Solution

  • With PCRE-based functions you need to use

    split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"
    

    Here,

    • (?: - start of a container non-capturing group:
    • (?!') - fail the match if the next char is a ' char
    • [[:space:][:punct:][:digit:]] - matches whitespace, punctuation or digit char
    • )+ - match one or more times (consecutively)
    • '\B - a ' char that is followed with either end of string or a non-word char
    • | - or
    • \B' - a ' that is preceded with either start of string or a non-word char.

    With stringr functions, you can use

    split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"
    

    Here, [[:space:][:punct:][:digit:]--[']] matches all characters matched by [[:space:][:punct:][:digit:]] except the ' chars.

    stringr ICU regex flavor supports character class subtraction using this notation.