Search code examples
rdata-cleaningcurrency

What is the most efficient way to clean currency data in R?


This is more of an open ended question since I'm not looking for a one particular code response. I have an example dataset, which is just a manually modified snippet of the set I am using. What I want to know is, for you more experienced R data analysts, how would you approach cleaning this data? Prioritising speed and efficiency (in terms of real time and computing time), what steps would you take and why, and which R packages are most useful? With enough googling and effort I have figured it out myself, but I'm certain my solutions are inelegant and inefficient. Do you have any little known tricks or functions that can fix multiple problems at once?

Imagine this is sales data pulled from a website with discounts and number of product ratings. Initially, every column is of character type: Note - I made it annoyingly messy on purpose

price discounted_price discount_percent rating_count
USD499 USD176.63 65% 15,188
USD299 USD399 25% 34,899
19900 $ 11,499 $ 42 4703
$475 $309 35% 4,26,973
$ 22,900 $ 13490 41% 16,299
699.99USD 263.28USD 62 450
$899 $349 61% 149
USD1999 USD1,299 35 590

The clean table should look like this:

price discounted_price discount_percent rating_count
$499.00 $176.63 65% 15,188
$299.00 $399.00 25% 34,899
$19,900.00 $11,499.00 42% 4,703
$475.00 $309.00 35% 426,973
$22,900.00 $13,490.00 41% 16,299
$699.99 $263.28 62% 450
$899.00 $349.00 61% 149
$1,999.00 $1,299.00 35% 590

Tasks to perform:

  • Make all price values in the format 0,000.00 with a comma separator for bigger values
  • Make all price values have a $ prefix
  • Make percentage column in the format xx%
  • Make rating_count in the format 0,000

Edit: Removed task to make all columns numeric as it didn't make sense


Solution

  • Discarding your "all columns numeric" constraint, since your expected output is all string, this is easily addressed with scales in a dplyr pipe:

    library(dplyr)
    library(scales) # dollar, percent, comma
    quux %>%
      mutate(
        across(everything(), ~ as.numeric(gsub("[^-0-9.]", "", .))),
        discount_percent = discount_percent/100,
        across(c(price, discounted_price), ~ dollar(.)),
        discount_percent = percent(discount_percent, accuracy=1),
        rating_count = comma(rating_count)
      )
    #        price discounted_price discount_percent rating_count
    # 1    $499.00          $176.63              65%       15,188
    # 2    $299.00          $399.00              25%       34,899
    # 3 $19,900.00       $11,499.00              42%        4,703
    # 4    $475.00          $309.00              35%      426,973
    # 5 $22,900.00       $13,490.00              41%       16,299
    # 6    $699.99          $263.28              62%          450
    # 7    $899.00          $349.00              61%          149
    # 8  $1,999.00        $1,299.00              35%          590
    

    If you want to do any analysis or numeric processing, then separate this into two stages:

    quux %>%
      mutate(
        across(everything(), ~ as.numeric(gsub("[^-0-9.]", "", .))),
        discount_percent = discount_percent/100
      ) %>%
      # do your analysis/calculations here
      mutate
        across(c(price, discounted_price), ~ dollar(.)),
        discount_percent = percent(discount_percent, accuracy=1),
        rating_count = comma(rating_count)
      )
    

    If you have any other columns that might already be numeric, you may want to replace everything() with where(is.character). (You can use it anyway if you choose.)

    This can also easily be done with base R or data.table. The scales package does make most of the formatting much easier.

    You can generalize the formatting a bit if you have multiple columns not present in this example, using

    quux %>%
      mutate(
        across(everything(), ~ as.numeric(gsub("[^-0-9.]", "", .))), 
        across(ends_with("percent"), ~ ./100)
      ) %>%
      # do your analysis/calculations here
      mutate(
        across(ends_with("price"), ~ dollar(.)), 
        across(ends_with("percent"), ~ percent(., accuracy=1)), 
        across(ends_with("count"), ~ comma(.))
      )
    

    Data

    quux <- structure(list(price = c("USD499", "USD299", "19900 $", "$475", "$ 22,900", "699.99USD", "$899", "USD1999"), discounted_price = c("USD176.63", "USD399", "11,499 $", "$309", "$ 13490", "263.28USD", "$349", "USD1,299"), discount_percent = c("65%", "25%", "42", "35%", "41%", "62", "61%", "35"), rating_count = c("15,188", "34,899", "4703", "4,26,973", "16,299", "450", "149", "590")), class = "data.frame", row.names = c(NA, -8L))