Search code examples
rfloating-pointintegernumber-formatting

Checking if a variable contains a floating-point value


In the R language, is there any reliable method of checking if a variable is floating-point or integer-valued?

I have looked at several proposed solutions. The R helpfile for is.integer(x) suggests using the round(x) function, as such:

is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol

However, this is checking for whole number, not a number with a floating point. This function would return TRUE for the cases of 1.0, 2.0, 3.0, ... where there is no fractional part of the value, but still a floating-point in its notation.

It boggles me that I haven't found a simple solution for checking this in R as it is designed to deal with manipulating large tables of (often heterogeneous) data. I need this particular solution as I am applying manipulations to data that are conditional on their type due to the potential data ranges across the whole dataset. So, if my algorithm comes across a value of the nature 1.0, 2.0, 3.0, ..., it should know that a particular manipulation is applicable.

I have tried converting the value to a string using sprintf() and then checking if the string contains a period. This has the obvious issue of formatting the number based on the specifier, hence:

x = 5.0
y = 5
sprintf("%f", x) # "5.000000"
sprintf("%f", y) # "5.000000"
sprintf("%d", x) # "5"
sprintf("%d", y) # "5"

As shown, there is no way to differentiate them.

EDIT: For more detail on my application and why this solution is needed.

I am working with large tables of data and therefore I am not defining the data myself. Part of my algorithm involves taking a row of data, incremementing/decrementing its values and passing it through a classifier to look for changes in the outcome. For simplicity, I am only considering numeric values. In terms of how the data is formatted, different numerical types can undergo different manipulations. A real-valued feature can be incrememented/decremented in fractional steps, e.g. 5.0, 4.8, 4.6, 4.4, ... whereas an integer-valued feature must be manipulated in whole steps, e.g. 5, 4, 3, 2, ...

I want to calculate the increment when I identify the data type of the feature. Thus, if the feature value is 32, the algorithm will increment/decrement in whole steps. If the feature value is 5.9, the algorithm will increment/decrement in fractional steps.

The problem is when the algorithm encounters a floating-point value of 5.0. This signifies that it is in the real-valued domain, but it is judged as a whole number according to R's functionality.

If I cannot find a solution for this, I will have to predefine the data type of each column in each table.

Some sample data from the diabetes dataset: https://pastebin.com/5FTsaC0g


Solution

  • The first thing to say is that in R, there is no difference between 5 and 5.0. As the R Language Definition sets out:

    Perhaps unexpectedly, the number returned from the expression 1 is a numeric... We can use the "L" suffix to qualify any number with the intent of making it an explicit integer.

    class(5) # numeric
    class(5.0) # numeric
    class(5L) # integer
    

    However, it sounds like this a case where you have a large-ish dataset that you have not defined, and you need to establish which columns are discrete and which are continuous.

    We can use mtcars as an example as all values stored as floats but some are actually discrete (e.g. number of gears, horsepower). Let's convert to a tibble so it prints column types nicely:

    dat <- dplyr::as_tibble(mtcars)
    head(dat, n = 2)
    # # A tibble: 6 × 11
    #     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    # 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
    # 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
    

    You can use the base R type.convert() to infer the column types for you:

    head(type.convert(dat, as.is = TRUE), n = 2)
    # # A tibble: 2 × 11
    #     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #   <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
    # 1    21     6   160   110   3.9  2.62  16.5     0     1     4     4
    # 2    21     6   160   110   3.9  2.88  17.0     0     1     4     4
    

    Or if you just want the column types:

    sapply(type.convert(dat, as.is = TRUE), class)
    #       mpg       cyl      disp        hp      drat        wt      qsec        vs        am      gear      carb 
    # "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric" "integer" "integer" "integer" "integer" 
    

    Obviously there is technically a non-zero chance that you could have a continuous variable where all the values happen to be exactly an integer, but I wouldn't worry about this unless you have a tiny number of rows or extremely imprecise measurements.