In the R language, is there any reliable method of checking if a variable is floating-point or integer-valued?
I have looked at several proposed solutions. The R helpfile for is.integer(x)
suggests using the round(x)
function, as such:
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
However, this is checking for whole number, not a number with a floating point. This function would return TRUE
for the cases of 1.0, 2.0, 3.0, ...
where there is no fractional part of the value, but still a floating-point in its notation.
It boggles me that I haven't found a simple solution for checking this in R as it is designed to deal with manipulating large tables of (often heterogeneous) data. I need this particular solution as I am applying manipulations to data that are conditional on their type due to the potential data ranges across the whole dataset. So, if my algorithm comes across a value of the nature 1.0, 2.0, 3.0, ...
, it should know that a particular manipulation is applicable.
I have tried converting the value to a string using sprintf()
and then checking if the string contains a period. This has the obvious issue of formatting the number based on the specifier, hence:
x = 5.0
y = 5
sprintf("%f", x) # "5.000000"
sprintf("%f", y) # "5.000000"
sprintf("%d", x) # "5"
sprintf("%d", y) # "5"
As shown, there is no way to differentiate them.
EDIT: For more detail on my application and why this solution is needed.
I am working with large tables of data and therefore I am not defining the data myself. Part of my algorithm involves taking a row of data, incremementing/decrementing its values and passing it through a classifier to look for changes in the outcome. For simplicity, I am only considering numeric values. In terms of how the data is formatted, different numerical types can undergo different manipulations. A real-valued feature can be incrememented/decremented in fractional steps, e.g. 5.0, 4.8, 4.6, 4.4, ...
whereas an integer-valued feature must be manipulated in whole steps, e.g. 5, 4, 3, 2, ...
I want to calculate the increment when I identify the data type of the feature. Thus, if the feature value is 32
, the algorithm will increment/decrement in whole steps. If the feature value is 5.9
, the algorithm will increment/decrement in fractional steps.
The problem is when the algorithm encounters a floating-point value of 5.0
. This signifies that it is in the real-valued domain, but it is judged as a whole number according to R's functionality.
If I cannot find a solution for this, I will have to predefine the data type of each column in each table.
Some sample data from the diabetes dataset: https://pastebin.com/5FTsaC0g
The first thing to say is that in R, there is no difference between 5
and 5.0
. As the R Language Definition sets out:
Perhaps unexpectedly, the number returned from the expression
1
is a numeric... We can use the"L"
suffix to qualify any number with the intent of making it an explicit integer.
class(5) # numeric
class(5.0) # numeric
class(5L) # integer
However, it sounds like this a case where you have a large-ish dataset that you have not defined, and you need to establish which columns are discrete and which are continuous.
We can use mtcars
as an example as all values stored as floats but some are actually discrete (e.g. number of gears, horsepower). Let's convert to a tibble
so it prints column types nicely:
dat <- dplyr::as_tibble(mtcars)
head(dat, n = 2)
# # A tibble: 6 × 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
You can use the base R type.convert()
to infer the column types for you:
head(type.convert(dat, as.is = TRUE), n = 2)
# # A tibble: 2 × 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
Or if you just want the column types:
sapply(type.convert(dat, as.is = TRUE), class)
# mpg cyl disp hp drat wt qsec vs am gear carb
# "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric" "integer" "integer" "integer" "integer"
Obviously there is technically a non-zero chance that you could have a continuous variable where all the values happen to be exactly an integer, but I wouldn't worry about this unless you have a tiny number of rows or extremely imprecise measurements.