Search code examples
rcharacterstripqdap

Strip function in qdap package in R - Removes slash incorrectly


I have a text file, which is several hundred rows long. I am trying to remove all of the [edit:add] punctuation characters from it except the "/" characters. I am currently using the strip function in the qdap package.

Here is a sample data set:

htxt <- c("{rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/", 
        "{fonttblf0fswissfcharset0 helvetica",
        "margl1440margr1440vieww9000viewh8400viewkind0")

Here is the code:

strip(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

The only problem with this beautiful function is that it removes the "/" characters. If I try to remove all characters except the "{" character it works:

strip(htxt, char.keep = "{", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

Has anyone experienced the same problem?


Solution

  • For whatever reason it seems the qdap:::strip always strips "/" out of character vectors. This is in the source code towards the end of the function:

    x <- clean(gsub("/", " ", gsub("-", " ", x)))
    

    This is run before the actual function which does the stripping which is defined in the body of the function strip....

    So just replace the function with your own version:

    strip.new <- function (x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, 
        lower.case = TRUE) 
    {
        strp <- function(x, digit.remove, apostrophe.remove, char.keep, 
            lower.case) {
            if (!is.null(char.keep)) {
                x2 <- Trim(gsub(paste0(".*?($|'|", paste(paste0("\\", 
                    char.keep), collapse = "|"), "|[^[:punct:]]).*?"), 
                    "\\1", as.character(x)))
            }
            else {
                x2 <- Trim(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", 
                    as.character(x)))
            }
            if (lower.case) {
                x2 <- tolower(x2)
            }
            if (apostrophe.remove) {
                x2 <- gsub("'", "", x2)
            }
            ifelse(digit.remove == TRUE, gsub("[[:digit:]]", "", 
                x2), x2)
        }
        unlist(lapply(x, function(x) Trim(strp(x = x, digit.remove = digit.remove, 
            apostrophe.remove = apostrophe.remove, char.keep = char.keep, 
            lower.case = lower.case))))
    }
    
    strip.new(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)
    
    #[1] "rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/"
    #[2] "fonttblf0fswissfcharset0 helvetica"            
    #[3] "margl1440margr1440vieww9000viewh8400viewkind0" 
    

    The package author is pretty active on this site so he can probably clear up why strip does this by default.