I am looking to perform data quality on numerous system generated tables. One of the checks is to see if all values in a character column are only numbers. I am looking to know the number columns where this check is true. Using the following table as an example I would want to identify that two columns (code and age) are character columns that consist of only numeric values.
Table Structure
Column Name | Data Type |
---|---|
name | character |
type | character |
code | character |
member_id | integer |
collection_date | date |
age | character |
height | double |
Table Values
Column Name | Column Values |
---|---|
name | only letters |
type | only letters |
code | only numbers |
member_id | only numbers |
collection_date | only dates |
age | only numbers |
height | only numbers |
I am having issues thinking of the logic that is required to do this. What I have done thus far is:
To select only columns that are character data types
df %>%
dplyr::select_if(is.character)
To validate that all values in the column are numeric (or null, which is fine)
sum(varhandle::check.numeric(df$code)) == nrow(df)
I am hoping to build a function that performs this across all columns and stores the number where the check is true into a column (i.e. dplyr::mutate), but I am not sure how to structure this; is it an across, an apply, or something else. Or is there any existing function/package that would perform this task? Any help is appreciated.
We could construct the condition within select
itself
library(dplyr)
iris %>%
select(where(~ all(varhandle::check.numeric(.))))
It is not clear whether numeric
columns with mismatched type or columns having some character elements and thus got converted to class
character
. If it is the former, then add a type.convert
before the select
and then get only the numeric columns
df %>%
type.convert(as.is = TRUE) %>%
select(where(is.numeric))