This is more of an open ended question since I'm not looking for a one particular code response. I have an example dataset, which is just a manually modified snippet of the set I am using. What I want to know is, for you more experienced R data analysts, how would you approach cleaning this data? Prioritising speed and efficiency (in terms of real time and computing time), what steps would you take and why, and which R packages are most useful? With enough googling and effort I have figured it out myself, but I'm certain my solutions are inelegant and inefficient. Do you have any little known tricks or functions that can fix multiple problems at once?
Imagine this is sales data pulled from a website with discounts and number of product ratings. Initially, every column is of character type: Note - I made it annoyingly messy on purpose
price | discounted_price | discount_percent | rating_count |
---|---|---|---|
USD499 | USD176.63 | 65% | 15,188 |
USD299 | USD399 | 25% | 34,899 |
19900 $ | 11,499 $ | 42 | 4703 |
$475 | $309 | 35% | 4,26,973 |
$ 22,900 | $ 13490 | 41% | 16,299 |
699.99USD | 263.28USD | 62 | 450 |
$899 | $349 | 61% | 149 |
USD1999 | USD1,299 | 35 | 590 |
The clean table should look like this:
price | discounted_price | discount_percent | rating_count |
---|---|---|---|
$499.00 | $176.63 | 65% | 15,188 |
$299.00 | $399.00 | 25% | 34,899 |
$19,900.00 | $11,499.00 | 42% | 4,703 |
$475.00 | $309.00 | 35% | 426,973 |
$22,900.00 | $13,490.00 | 41% | 16,299 |
$699.99 | $263.28 | 62% | 450 |
$899.00 | $349.00 | 61% | 149 |
$1,999.00 | $1,299.00 | 35% | 590 |
Tasks to perform:
Edit: Removed task to make all columns numeric as it didn't make sense
Discarding your "all columns numeric" constraint, since your expected output is all string, this is easily addressed with scales
in a dplyr
pipe:
library(dplyr)
library(scales) # dollar, percent, comma
quux %>%
mutate(
across(everything(), ~ as.numeric(gsub("[^-0-9.]", "", .))),
discount_percent = discount_percent/100,
across(c(price, discounted_price), ~ dollar(.)),
discount_percent = percent(discount_percent, accuracy=1),
rating_count = comma(rating_count)
)
# price discounted_price discount_percent rating_count
# 1 $499.00 $176.63 65% 15,188
# 2 $299.00 $399.00 25% 34,899
# 3 $19,900.00 $11,499.00 42% 4,703
# 4 $475.00 $309.00 35% 426,973
# 5 $22,900.00 $13,490.00 41% 16,299
# 6 $699.99 $263.28 62% 450
# 7 $899.00 $349.00 61% 149
# 8 $1,999.00 $1,299.00 35% 590
If you want to do any analysis or numeric processing, then separate this into two stages:
quux %>%
mutate(
across(everything(), ~ as.numeric(gsub("[^-0-9.]", "", .))),
discount_percent = discount_percent/100
) %>%
# do your analysis/calculations here
mutate
across(c(price, discounted_price), ~ dollar(.)),
discount_percent = percent(discount_percent, accuracy=1),
rating_count = comma(rating_count)
)
If you have any other columns that might already be numeric, you may want to replace everything()
with where(is.character)
. (You can use it anyway if you choose.)
This can also easily be done with base R or data.table
. The scales
package does make most of the formatting much easier.
You can generalize the formatting a bit if you have multiple columns not present in this example, using
quux %>%
mutate(
across(everything(), ~ as.numeric(gsub("[^-0-9.]", "", .))),
across(ends_with("percent"), ~ ./100)
) %>%
# do your analysis/calculations here
mutate(
across(ends_with("price"), ~ dollar(.)),
across(ends_with("percent"), ~ percent(., accuracy=1)),
across(ends_with("count"), ~ comma(.))
)
Data
quux <- structure(list(price = c("USD499", "USD299", "19900 $", "$475", "$ 22,900", "699.99USD", "$899", "USD1999"), discounted_price = c("USD176.63", "USD399", "11,499 $", "$309", "$ 13490", "263.28USD", "$349", "USD1,299"), discount_percent = c("65%", "25%", "42", "35%", "41%", "62", "61%", "35"), rating_count = c("15,188", "34,899", "4703", "4,26,973", "16,299", "450", "149", "590")), class = "data.frame", row.names = c(NA, -8L))