Search code examples
rtidyversereadr

Apply readr col_cpec to data.frame, independently of reading from file


I have a tibble (data.frame) that I need to apply a number of type updates to. I have a readr::col_spec object that describes the desired types, but since the data does not originate as a csv file, I cannot use read_csv(..., col_types=cspec) to apply the changes to the specified columns.

Since col_spec is a data structure designed exactly to specify desired data types, I would nevertheless to use it directly as an input to a function that applies the changes for me, rather than writing a long custom script to apply the different columns. See the following example:

library(tidyverse)

# Subset starwars to get sw (comparable to my input data)
sw <- starwars %>%
  select(name, height, ends_with("_color")) %>%
  slice(c(1,4,5,19))
sw
#> # A tibble: 4 × 5
#>   name           height hair_color skin_color eye_color
#>   <chr>           <int> <chr>      <chr>      <chr>    
#> 1 Luke Skywalker    172 blond      fair       blue     
#> 2 Darth Vader       202 none       white      yellow   
#> 3 Leia Organa       150 brown      light      brown    
#> 4 Yoda               66 white      green      brown

# The col_spec that I have
cspec <- cols(
  hair_color = col_factor(c("brown", "blond", "white", "none")),
  skin_color = col_factor(c( "green", "light", "fair", "white")),
  eye_color = col_factor(c("blue", "brown", "yellow"))
)

# I would like to apply the col_spec directly to sw

# A not so great workaround is to use a tempfile
tf <- tempfile()
sw %>% write_csv(tf)
sw_fct <- read_csv(tf, col_types=cspec)

# This is more or less the result I am after:
# But note how info on other columns (height) is lost in the roundtrip
sw_fct
#> # A tibble: 4 × 5
#>   name           height hair_color skin_color eye_color
#>   <chr>           <dbl> <fct>      <fct>      <fct>    
#> 1 Luke Skywalker    172 blond      fair       blue     
#> 2 Darth Vader       202 none       white      yellow   
#> 3 Leia Organa       150 brown      light      brown    
#> 4 Yoda               66 white      green      brown

Solution

  • We may do this by extracting the elements from the object by looping overs the cols

    library(readr)
    library(purrr)
    sw[names(cspec$cols)] <- imap(cspec$cols, ~ parse_factor(sw[[.y]],
         levels = .x$levels, ordered = .x$ordered, include_na = .x$include_na))
    

    -checking the output

    > sw
    # A tibble: 4 × 5
      name           height hair_color skin_color eye_color
      <chr>           <int> <fct>      <fct>      <fct>    
    1 Luke Skywalker    172 blond      fair       blue     
    2 Darth Vader       202 none       white      yellow   
    3 Leia Organa       150 brown      light      brown    
    4 Yoda               66 white      green      brown    
    
    > str(sw)
    tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
     $ name      : chr [1:4] "Luke Skywalker" "Darth Vader" "Leia Organa" "Yoda"
     $ height    : int [1:4] 172 202 150 66
     $ hair_color: Factor w/ 4 levels "brown","blond",..: 2 4 1 3
     $ skin_color: Factor w/ 4 levels "green","light",..: 3 4 2 1
     $ eye_color : Factor w/ 3 levels "blue","brown",..: 1 3 2 2
    

    If we also need the attributes of 'spec', do the assignment

    attr(sw, "spec") <- cspec
    

    -checking the str

    > str(sw)
    tibble [4 × 5] (S3: tbl_df/tbl/data.frame)
     $ name      : chr [1:4] "Luke Skywalker" "Darth Vader" "Leia Organa" "Yoda"
     $ height    : int [1:4] 172 202 150 66
     $ hair_color: Factor w/ 4 levels "brown","blond",..: 2 4 1 3
     $ skin_color: Factor w/ 4 levels "green","light",..: 3 4 2 1
     $ eye_color : Factor w/ 3 levels "blue","brown",..: 1 3 2 2
     - attr(*, "spec")=
      .. cols(
      ..   hair_color = col_factor(levels = c("brown", "blond", "white", "none"), ordered = FALSE, include_na = FALSE),
      ..   skin_color = col_factor(levels = c("green", "light", "fair", "white"), ordered = FALSE, include_na = FALSE),
      ..   eye_color = col_factor(levels = c("blue", "brown", "yellow"), ordered = FALSE, include_na = FALSE)
      .. )