Search code examples
rdplyrencodingtidyverse

Change Encoding of tibble columns in tidyverse


I have data from a MySQL databases which returns latin1 encoded with hex in the following format:

> data
# A tibble: 52 × 4
      id code   swe_name         eng_name          
   <int> <chr>  <chr>            <chr>             
 1     0 ""     ""               ""                
 2     1 "M"    "muskel"         "muscle"          
 3     2 "L"    "lever"          "liver"           
 4     3 "N"    "njure"          "kidney"          
 5     4 "Fj"   "fj\xe4der"      "feather"         
 6     5 "Hj"   "hj\xe4rna"      "brain"           
 7     6 "\xd6" "\xf6vrigt"      "other"           
 8     7 "Ind"  "hela individen" "whole individual"
 9     8 "F"    "fett/talg"      "fat"             
10     9 "DF"   "sp\xe4ck"       "lard"            
# ℹ 42 more rows
# ℹ Use `print(n = ...)` to see more rows

I need to set the Encoding of the columns to "latin1". I can do it 'manually' with:

for (c in colnames(data)[vapply(data, is.character, TRUE)]) {
    Encoding(data[, c, drop = TRUE]) <- "latin1" 
}

This however is hard to read and feels clunky. I thought of using dplyr, where the columns can be selected nicely using

data |> select_if(is.character)

or

data |> mutate(across(where(is.character), ...))

and I've think it might be possible to use mutate and across, but the Encoding(x) <- "latin1" isn't really a function that can be applied in place, and function(x) { Encoding(x) <- "latin1" } doesn't work.

If possible (or preferable) how can this be done using dplyr and tidyverse? Since the column-selection is so convenient.


Solution

  • Richie is right, it's better to read in the data correctly rather than cleaning after the fact. With that said, if you are forced to clean, here's a couple of ways of doing it:

    library(dplyr)
    
    mutate(df, across(where(is.character), ~iconv(., "latin1", "UTF-8")))
    mutate(df, across(where(is.character), ~stringi::stri_encode(.x, "latin1", "UTF-8")))