I have data from a MySQL databases which returns latin1 encoded with hex in the following format:
> data
# A tibble: 52 × 4
id code swe_name eng_name
<int> <chr> <chr> <chr>
1 0 "" "" ""
2 1 "M" "muskel" "muscle"
3 2 "L" "lever" "liver"
4 3 "N" "njure" "kidney"
5 4 "Fj" "fj\xe4der" "feather"
6 5 "Hj" "hj\xe4rna" "brain"
7 6 "\xd6" "\xf6vrigt" "other"
8 7 "Ind" "hela individen" "whole individual"
9 8 "F" "fett/talg" "fat"
10 9 "DF" "sp\xe4ck" "lard"
# ℹ 42 more rows
# ℹ Use `print(n = ...)` to see more rows
I need to set the Encoding of the columns to "latin1". I can do it 'manually' with:
for (c in colnames(data)[vapply(data, is.character, TRUE)]) {
Encoding(data[, c, drop = TRUE]) <- "latin1"
}
This however is hard to read and feels clunky. I thought of using dplyr, where the columns can be selected nicely using
data |> select_if(is.character)
or
data |> mutate(across(where(is.character), ...))
and I've think it might be possible to use mutate
and across
, but the Encoding(x) <- "latin1"
isn't really a function that can be applied in place, and function(x) { Encoding(x) <- "latin1" }
doesn't work.
If possible (or preferable) how can this be done using dplyr and tidyverse? Since the column-selection is so convenient.
Richie is right, it's better to read in the data correctly rather than cleaning after the fact. With that said, if you are forced to clean, here's a couple of ways of doing it:
library(dplyr)
mutate(df, across(where(is.character), ~iconv(., "latin1", "UTF-8")))
mutate(df, across(where(is.character), ~stringi::stri_encode(.x, "latin1", "UTF-8")))