Search code examples
rdata-quality

Data quality - check if all values in a character column are numbers in R


I am looking to perform data quality on numerous system generated tables. One of the checks is to see if all values in a character column are only numbers. I am looking to know the number columns where this check is true. Using the following table as an example I would want to identify that two columns (code and age) are character columns that consist of only numeric values.

Table Structure

Column Name Data Type
name character
type character
code character
member_id integer
collection_date date
age character
height double

Table Values

Column Name Column Values
name only letters
type only letters
code only numbers
member_id only numbers
collection_date only dates
age only numbers
height only numbers

I am having issues thinking of the logic that is required to do this. What I have done thus far is:

To select only columns that are character data types

df %>%
  dplyr::select_if(is.character)

To validate that all values in the column are numeric (or null, which is fine)

sum(varhandle::check.numeric(df$code)) == nrow(df)

I am hoping to build a function that performs this across all columns and stores the number where the check is true into a column (i.e. dplyr::mutate), but I am not sure how to structure this; is it an across, an apply, or something else. Or is there any existing function/package that would perform this task? Any help is appreciated.


Solution

  • We could construct the condition within select itself

    library(dplyr)
    iris %>%
       select(where(~ all(varhandle::check.numeric(.)))) 
    

    It is not clear whether numeric columns with mismatched type or columns having some character elements and thus got converted to class character. If it is the former, then add a type.convert before the select and then get only the numeric columns

    df %>%
       type.convert(as.is = TRUE) %>%
       select(where(is.numeric))