I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:
> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 <NA> B
3 0.12 b <NA>
4 0.13 c C
5 0.14 d D
6 0.15 <NA> E
7 0.15 e <NA>
8 0.16 f F
9 0.17 g G
10 0.18 h H
11 0.19 i I
12 0.20 j J
Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal
function to remove floating point artifacts, however I don't believe there is a way to do this within the merge
function.
Currently I get around it by forcing one of the the number
columns to a character and then back to a number after merge
, but this is a little clunky; does anyone have a better solution for this problem?
> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
EDIT: A little more about the data
ts
) object and a number of functions are called to extract features from the time series (one of which is the time
value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time
value.POSIXct
: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct
value would be difficult since the series aren't necessarily stored in calendar times.Round to the level of precision that will allow the number to be equal.
> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
>
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16)
.