Search code examples
rstringgsubparenthesesdollar-sign

Removing parenthesis & dollar sign using gsub() in R?


I've imported a csv file to r. Few of my columns contains currency data (ex. starts with '$'). To remove the '$' and convert the into numeric type, I've used gsub() function and it works fine. Sample code is:

strip_dollars = function(x) { as.numeric(gsub("[\\$, ]", "", x)) }
lapply(data[ , c("x4", "x5", "x7", "x8")] , strip_dollars) 

But in one particular currency column, I've few entries incuding parenthesis like ($2,345), $3,500, $5,600, ($3,234), ..... Hence, for this column the above code doesn't work. Please help me out with removing parenthesis & dollar sign from these entries.

Also, to identify columns containing currency data, I tried a lot including solutions given in R - identify which columns contain currency data $ but it doesn't work. Please suggest an optimal way of doing so.


Solution

  • I think the easiest way is to step through what is replaced. I'm inferring that you do not want to lose the negative-ness suggested by the parens, so we'll do two steps:

    s <- c("($2,345)", "$3,500", "$5,600", "($3,234)")
    gsub("[$),]", "", s)
    # [1] "(2345" "3500"  "5600"  "(3234"
    

    This removes most of the junk (that we did not want/need to keep), now let's deal with the leading left-paren:

    gsub("^\\s*[(]", "-", gsub("[$),]", "", s))
    # [1] "-2345" "3500"  "5600"  "-3234"
    

    From here, if needed, you can convert to numeric:

    as.numeric(gsub("^\\s*[(]", "-", gsub("[$),]", "", s)))
    # [1] -2345  3500  5600 -3234
    

    This is not very robust to mal-formed strings. For instance, though I look for (and remove) leading space for the left-paren, anything else there will be a problem.