Search code examples
rdplyrtype-conversiondata-cleaningreadr

Convert numeric column to integer if possible, otherwise keep as numeric


Background

I was importing and cleaning a dataset (head included below) using read_csv() (from ), and noticed various columns which probably should be integer columns were left as numeric columns by readr.

Question

What is an elegant way of converting the numeric columns to integer columns if it is possible (i.e. if there would be no rounding/loss in accuracy), and if not, leaving them as numeric?

For the example dataset, it would mean converting the columns "Execution", "TDCJ Number", "Age at Execution", and "Weight".

It doesn't need to be a tidyverse solution, base R or any package(s) are fine too.

Research done:

I tried searching Google, and Stack Overflow, without any luck. Honestly, I'm surprised this question hasn't been asked before! read.csv() converts to integer, but doesn't convert dates automatically, like read_csv().

Data:

# the entire dataset can be downloaded at https://archive.org/download/tx_deathrow_full/tx_deathrow_full.csv if you wish. The head is below

library(tidyverse)

tx_deathrow <- structure(list(Execution = c(553, 552, 551, 550, 549, 548), `Date of Birth` = structure(c(5014, 
-6701, 4110, 6302, 3737, -5266), class = "Date"), `Date of Offence` = structure(c(12743, 
3433, 12389, 13975, 13039, 11444), class = "Date"), `Highest Education Level` = c(9, 
12, 10, 11, 12, 12), `Last Name` = c("Young", "Bible", "Castillo", 
"Davila", "Rodriguez III", "Battaglia"), `First Name` = c("Christopher Anthony", 
"Danny Paul", "Juan Edward", "Erick Daniel", "Rosendo", "John David"
), `TDCJ
Number` = c(999508, 999455, 999502, 999545, 999534, 
999412), `Age at Execution` = c(34, 66, 37, 31, 38, 62), `Date Received` = structure(c(13238, 
12250, 13053, 14302, 14013, 11808), class = "Date"), `Execution Date` = structure(c(17729, 
17709, 17667, 17646, 17617, 17563), class = "Date"), Race = c("Black", 
"White", "Hispanic", "Black", "Hispanic", "White"), County = c("Bexar", 
"Harris", "Bexar", "Tarrant", "Lubbock", "Dallas"), `Eye Color` = c("Brown", 
"Blue", "Brown", "Brown", "Brown", "Green"), Weight = c(216, 
194, 180, 161, 198, 188), Height = c("6' 1\"", "5' 7\"", "5' 11\"", 
"5' 11\"", "5' 8\"", "6' 0\""), `Native County` = c("Bexar", 
"Brazoria", "Bexar", "Tarrant", "Wichita", "Dallas"), `Native State` = c("Texas", 
"Texas", "Texas", "Texas", "Texas", "Texas"), `Last Statement` = c("l want to make sure the Patel family knows I love them like they love me. Make sure the kids in the world know I'm being executed and those kids I've been mentoring keep this fight going. I'm good Warden.", 
NA, "To everyone that has been there for me you know who you are. Love y'all. See y'all on the other side.That's it.", 
"Yes, I would like to say nephew it burns huh. You know I might have lost the fight but I'm still a soldier. I still love you all. To my supporters and family y'all hold it down. Ten Toes down right. That's all.", 
"First I would like to say I have been here since September 2005.  I had the honor and privilege to know many prison guards and staff.  I want to thank all of them.  I would like for everyone to write the people on death row as they are all good men and I am very happy I got to know them.  All of their lives are worth knowing about.\n\nSecondly on February 14th the medical examiner and the chief nurse were engaged in numerous false illegal acts.  They tried to cover up that thousands were wrongfully convicted by Matt Powell, district attorney.  This needs to be brought to justice.\n\nI call upon the FBI to investigate Matt Powell and the Lubbock County Medical Examiner.  Lastly, I was born and raised Catholic and it was not lost upon me that this is Holy Week and last Sunday was Palm Sunday.  Yesterday was my birthday.  Today is the day I join my God and father.  The state may have my body but not my soul.\n\nIn order to save my brothers on death row I call upon Pope Francis and all the people of the world.\n\nLastly, I want everyone to boycott every single business  in the state of Texas until all the businesses are pressed to stop the death penalty.\n\nWith that Lord I commend my spirit.\n\nWarden I am ready to join my father.", 
"No, Well, Hi Mary Jean. See y'all later. Go ahead please.")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • Although some {readr} functions like parse_guess() or type_convert() have already been suggested in the comments, they only work on columns of type character correctly, which is why they can't correct columns that should be numeric instead of double.

    One trick is to read in all columns as character with the argument col_types = cols(.default = "c"). Then we can pipe the result into type_convert() with guess_integer set TRUE.

    Don't know why {readr} uses a different process to guess the columns internally when reading the csv - doesn't make much sense to me.

    library(readr)
    library(dplyr)
    
    tx_deathrow <- read_csv('https://archive.org/download/tx_deathrow_full/tx_deathrow_full.csv',
                            col_types = cols(.default = "c")) %>% 
      type_convert(guess_integer = TRUE)
    #> 
    #> ── Column specification ────────────────────────────────────────────────────────
    #> cols(
    #>   Execution = col_integer(),
    #>   `Date of Birth` = col_date(format = ""),
    #>   `Date of Offence` = col_date(format = ""),
    #>   `Highest Education Level` = col_double(),
    #>   `Last Name` = col_character(),
    #>   `First Name` = col_character(),
    #>   `TDCJ
    #> Number` = col_integer(),
    #>   `Age at Execution` = col_integer(),
    #>   `Date Received` = col_date(format = ""),
    #>   `Execution Date` = col_date(format = ""),
    #>   Race = col_character(),
    #>   County = col_character(),
    #>   `Eye Color` = col_character(),
    #>   Weight = col_integer(),
    #>   Height = col_character(),
    #>   `Native County` = col_character(),
    #>   `Native State` = col_character(),
    #>   `Last Statement` = col_character()
    #> )
    
    tx_deathrow %>% 
      glimpse()
    
    #> Rows: 553
    #> Columns: 18
    #> $ Execution                 <int> 553, 552, 551, 550, 549, 548, 547, 546, 545,…
    #> $ `Date of Birth`           <date> 1983-09-24, 1951-08-28, 1981-04-03, 1987-04…
    #> $ `Date of Offence`         <date> 2004-11-21, 1979-05-27, 2003-12-03, 2008-04…
    #> $ `Highest Education Level` <dbl> 9, 12, 10, 11, 12, 12, 12, 12, 11, 8, 10, 9,…
    #> $ `Last Name`               <chr> "Young", "Bible", "Castillo", "Davila", "Rod…
    #> $ `First Name`              <chr> "Christopher Anthony", "Danny Paul", "Juan E…
    #> $ `TDCJ\nNumber`            <int> 999508, 999455, 999502, 999545, 999534, 9994…
    #> $ `Age at Execution`        <int> 34, 66, 37, 31, 38, 62, 64, 55, 47, 38, 46, …
    #> $ `Date Received`           <date> 2006-03-31, 2003-07-17, 2005-09-27, 2009-02…
    #> $ `Execution Date`          <date> 2018-07-17, 2018-06-27, 2018-05-16, 2018-04…
    #> $ Race                      <chr> "Black", "White", "Hispanic", "Black", "Hisp…
    #> $ County                    <chr> "Bexar", "Harris", "Bexar", "Tarrant", "Lubb…
    #> $ `Eye Color`               <chr> "Brown", "Blue", "Brown", "Brown", "Brown", …
    #> $ Weight                    <int> 216, 194, 180, 161, 198, 188, 179, 198, 204,…
    #> $ Height                    <chr> "6' 1\"", "5' 7\"", "5' 11\"", "5' 11\"", "5…
    #> $ `Native County`           <chr> "Bexar", "Brazoria", "Bexar", "Tarrant", "Wi…
    #> $ `Native State`            <chr> "Texas", "Texas", "Texas", "Texas", "Texas",…
    #> $ `Last Statement`          <chr> "l want to make sure the Patel family knows …
    

    Another way of reading in the data is data.table::fread() which really does a nice job in converting correctly to numeric if appropriate. However, in this case dates are converted to IDate an integer based date format which is non standard. Further, the usual standard output is a data.table but we can get back a data.frame by setting data.table = FALSE. I unfortunately don't see a way to prevent dates from being converted to IDate.

    library(data.table)
    
    tx_deathrow <- data.table::fread('https://archive.org/download/tx_deathrow_full/tx_deathrow_full.csv',
                                     data.table = FALSE)
    
    str(tx_deathrow)
    #> 'data.frame':    553 obs. of  18 variables:
    #>  $ Execution              : int  553 552 551 550 549 548 547 546 545 544 ...
    #>  $ Date of Birth          : IDate, format: "1983-09-24" "1951-08-28" ...
    #>  $ Date of Offence        : IDate, format: "2004-11-21" "1979-05-27" ...
    #>  $ Highest Education Level: num  9 12 10 11 12 12 12 12 11 8 ...
    #>  $ Last Name              : chr  "Young" "Bible" "Castillo" "Davila" ...
    #>  $ First Name             : chr  "Christopher Anthony" "Danny Paul" "Juan Edward" "Erick Daniel" ...
    #>  $ TDCJ
    #> Number           : int  999508 999455 999502 999545 999534 999412 999371 999488 999275 999411 ...
    #>  $ Age at Execution       : int  34 66 37 31 38 62 64 55 47 38 ...
    #>  $ Date Received          : IDate, format: "2006-03-31" "2003-07-17" ...
    #>  $ Execution Date         : IDate, format: "2018-07-17" "2018-06-27" ...
    #>  $ Race                   : chr  "Black" "White" "Hispanic" "Black" ...
    #>  $ County                 : chr  "Bexar" "Harris" "Bexar" "Tarrant" ...
    #>  $ Eye Color              : chr  "Brown" "Blue" "Brown" "Brown" ...
    #>  $ Weight                 : int  216 194 180 161 198 188 179 198 204 160 ...
    #>  $ Height                 : chr  "6' 1\"\"" "5' 7\"\"" "5' 11\"\"" "5' 11\"\"" ...
    #>  $ Native County          : chr  "Bexar" "Brazoria" "Bexar" "Tarrant" ...
    #>  $ Native State           : chr  "Texas" "Texas" "Texas" "Texas" ...
    #>  $ Last Statement         : chr  "l want to make sure the Patel family knows I love them like they love me. Make sure the kids in the world know "| __truncated__ "" "To everyone that has been there for me you know who you are. Love y'all. See y'all on the other side.That's it." "Yes, I would like to say nephew it burns huh. You know I might have lost the fight but I'm still a soldier. I s"| __truncated__ ...
    

    Created on 2024-03-16 with reprex v2.0.2