Search code examples
rdata-cleaning

How do you clean a column with mixed height units in r


I have a column in my dataframe from a survey that has two different units in it. I need to remove these and convert the info into a consistent unit ie a column of all cm without the unit being present.

Here is some sample data

df <- data.frame(v1 = c('100 cm', '6 foot 10', '200 cm', '5 foot 11')

I attempted to use this readr::parse_number(df$v1) but that would turn '6 foot 10' into 6. I'm not sure it's that helpful anyway because I still need to convert the heights recorded as feet and inches into cm


Solution

  • You can split the strings by foot ('6 foot 10' >> c(6, 10)), and then map through the items, parse them as numbers, and -- if they have length 2 (ie, are the result of a string split) -- multiply them by the conversion rates.

    library(tidyverse)
    df$v1 %>%
      strsplit(" foot ") %>%
      map_dbl(function(num){
        num = parse_number(num)
        if(length(num) == 2) {sum(num * c(30.48, 2.54))} else {num}})
    
    [1] 100.00 208.28 200.00 180.34
    

    Another option, but less elegant in my opinion, is to change the '6 foot 10' to '6*30.48 + 2.54*10' (multiply the #feet and #inches with their conversion rates) using stringr::str_replace_all, and then evaluate those strings as expressions with eval + parse (with a mapping function like sapply or purrr::map, because these functions are not vectorised).

    df$v1 %>%
      str_replace_all(c(" cm" = "", " foot " = "*30.48 + 2.54*")) %>%
      map_dbl(~ parse(text = .x) %>% eval())
    
    [1] 100.00 208.28 200.00 180.34