Search code examples
rcurrencydata-cleaning

R - identify which columns contain currency data $


I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.

name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)

clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)

The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.


Solution

  • Using dplyr and stringr packages, you can use mutate_if to identify columns that have any string starting with a $ and then change the accordingly.

    library(dplyr)
    library(stringr)
    
    emp_data %>%
      mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
                ~as.numeric(str_replace_all(., '[$,]', '')))