Search code examples
rjoinmergeinterpolationimputets

R: Why is merge dropping data? How to interpolate missing values for a merge


I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.

The problem is that the merge is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join as well, but that doesn't seem to help. See below for more details.

EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr"). So, I needed to interpolate missing values from dB before I could merge (see answer below). END EDIT

see link at the bottom of the page to reproduce this example.

PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)

dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)

# confirm that data are stored as characters
str(PC17)
str(dB)

Now to compare my SiteID values, I use unique to see what character strings I have, and setdiff to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).

sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))

setdiff(PC17$SiteID, dB$SiteID)  ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID)  ## FI7D is the only one missing, this is ok

Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.

library(dplyr)
dB %>% 
  group_by(SiteID) %>% 
  summarise(
    min_dBL50=min(dbAL050, na.rm=TRUE),
    max_dBL50=max(dbAL050, na.rm=TRUE),
    mean_dBL50=mean(dbAL050, na.rm=TRUE),
    med_dBL50=median(dbAL050, na.rm=TRUE)
  )

# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        35.3      57.3       47.0      47.6
 2 CU1M        33.7      66.8       58.6      60.8
 3 CU1U        31.4      55.9       43.1      43.3
 4 CU2D        40        58.3       45.3      45.2
 5 CU2M        32.4      55.8       41.6      41.3
 6 CU2U        31.4      58.1       43.9      42.6
 7 CU3D        40.6      59.5       48.4      48.5
 8 CU3M        35.8      75.5       65.9      69.3
 9 CU3U        40.9      59.2       46.6      46.2
10 CU4D        36.6      49.1       43.6      43.4
# ... with 49 more rows

Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr" - keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE).

However, when I look at the summary of this data, now all of the SiteID have different values, and some sites are missing completely such as "CU3D" and "CU4D".

PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))

PCdB %>% 
  group_by(SiteID) %>% 
  summarise(
    min_dBL50=min(dbAL050, na.rm=TRUE),
    max_dBL50=max(dbAL050, na.rm=TRUE),
    mean_dBL50=mean(dbAL050, na.rm=TRUE),
    med_dBL50=median(dbAL050, na.rm=TRUE)
  )

# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        47.2      54         52.3      54  
 2 CU1M        35.4      63         49.2      49.2
 3 CU1U        35.3      35.3       35.3      35.3
 4 CU2D        42.3      42.3       42.3      42.3
 5 CU2M        43.1      43.2       43.1      43.1
 6 CU2U        43.7      43.7       43.7      43.7
 7 CU3D       Inf      -Inf        NaN        NA  
 8 CU3M        44.1      71.2       57.6      57.6
 9 CU3U        45        45         45        45  
10 CU4D       Inf      -Inf        NaN        NA  
# ... with 49 more rows

I set everything to characters with as.character() in the first lines. Additionally, I have checked Year, Day, Mo, and Hr with setdiff and unique just as I did above with SiteID, and there don't appear to be any issues with those character strings not matching.

I have also tried dplyr function left_join to merge the datasets, and it hasn't made a difference.


Solution

  • In the end, I answered this question with a better understanding of the data. The merge function itself was not dropping any values, since it was only doing exactly as one tells it. However, since datasets were merged by SiteID, Year, Mo, Day, Hr the result was Inf, NaN, and NA values for a few SiteID.

    The reason for this is that dB is not a fully continuous dataset to merge with. Thus, Inf, NaN, and NA values for some SiteID were returned because data did not overlap in all variables (SiteID, Year, Mo, Day, Hr).

    So I solved this problem with interpolation. That is, I filled the missing values in based on values from dates on either side of the missing values. The package imputeTS was valuable here.

    So I first interpolated the missing values in between the dates with data, and then I re-merged the datasets.

    library(imputeTS)
    library(tidyverse)
    
    ### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging. 
    ### Why? Because the merge drops all the data that would help with the interpolation!!
    
    dB<-read.csv("dB.csv")
    
    dB_clean <- dB %>%
      mutate_if(is.integer, as.character)
    
    # Create a wide table with spots for each minute. Missing will
    #   show up as NA's
    # All the NA's here in the columns represent 
    #   missing jDays that we should add. jDay is an integer date 'julian date'
    dB_NA_find <- dB_clean %>%
      count(SiteID, jDay) %>%
      spread(jDay, n)
    
    dB_NA_find
    # A tibble: 59 x 88
    # SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641` 
    # <fct>    <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>  
    #   1 CU1D        NA      NA      NA      NA      NA      NA      NA      NA     
    # 2 CU1M        NA      11      24      24      24      24      24      24      
    # 3 CU1U        NA      11      24      24      24      24      24      24      
    # 4 CU2D        NA      NA      NA      NA      NA      NA      NA      NA      
    # 5 CU2M        NA       9      24      24      24      24      24      24      
    # 6 CU2U        NA       9      24      24      24      24      21      NA      
    # 7 CU3D        NA      NA      NA      NA      NA      NA      NA      NA      
    # 8 CU3M        NA      NA      NA      NA      NA      NA      NA      NA      
    # 9 CU3U        NA      NA      NA      NA      NA      NA      NA      NA     
    # 10 CU4D        NA      NA      NA      NA      NA      NA      NA      NA     
    
    
    # Take the NA minute entries and make the desired line for each
    dB_rows_to_add <- dB_NA_find %>%
      gather(jDay, count, 2:88) %>%
      filter(is.na(count)) %>%
      select(-count, -NA) 
    
    # Add these lines to the original,  remove the NA jDay rows 
    #   (these have been replaced with jDay rows), and sort
    dB <- dB_clean %>%
      bind_rows(dB_rows_to_add) %>%
      filter(jDay != "NA") %>%
      arrange(SiteID, jDay)
    
    
    length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
    # [1] 3030
    
    ## Here is where we do the na.interpolation with package imputeTS
    # prime the for loop with zeros
    D<-rep("0",17)
    sites<-unique(dB$SiteID)
    
    for(i in 1:length(sites)){
      temp<-dB[dB$SiteID==sites[i], ]
      temp<-temp[order(temp$jDay),]
      temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
      D<-rbind(D, temp)
    }
    
    # delete the first row of zeros from above 'priming'
    dBN<-D[-1,]
    
    length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
    # [1] 0
    

    Because I did the above interpolation of NAs based on jDay, I am missing the Month (Mo), Day, and Year information for those rows.

    dBN$Year<-"2017"  #all data are from 2017
    
    ##I could not figure out how jDay was formatted, so I created a manual 'key' 
    ##to get Mo and Day by counting from a known date/jDay pair in original data
    
    #Example:
    # 13635 is Mo=5 Day=1
    # 13665 is Mo=5 Day=31
    # 13666 is Mo=6 Day=1
    # 13695 is Mo=6 Day=30
    
    key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
    key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
    key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
    key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)
    
    #make master 'key'
    key<-rbind(key4,key5,key6,key7)
    
    # Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
    dBM<-merge(dBN, key, by="jDay", all.x=TRUE)
    
    #clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
    dBM<-dBM[ , -c(2,3,6:16)]
    colnames(dBM)[5:6]<-c("Day","Mo")
    
    #I noticed an issue with duplication - merge with PC17 created a massive dataframe
    dBM %>%  ### Have too many observations per day, will duplicate merge out of control.
      count(SiteID, jDay, DayL50) %>% 
      summarise(
        min=min(n, na.rm=TRUE),
        mean=mean(n, na.rm=TRUE),
        max=max(n, na.rm=TRUE)
      )
    
    ## to fix this I only kept distinct observations so that each day has 1 observation
    dB<-distinct(dBM, .keep_all = TRUE)
    ### Now run above line again to check how many observations per day are left. Should be 1
    

    Now when you do the merge with dB and PC17, the interpolated values (that were missing NAs before) should be included. It will look something like this:

    > PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
    > ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.
    
    > library(dplyr)
    
    #Here is the NA interpolated 'dB' dataset 
    > dB %>% 
    +   group_by(SiteID) %>% 
    +   dplyr::summarise(
    +     min_dBL50=min(DayL50, na.rm=TRUE),
    +     max_dBL50=max(DayL50, na.rm=TRUE),
    +     mean_dBL50=mean(DayL50, na.rm=TRUE),
    +     med_dBL50=median(DayL50, na.rm=TRUE)
    +   )
    # A tibble: 59 x 5
       SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
       <chr>      <dbl>     <dbl>      <dbl>     <dbl>
     1 CU1D        44.7      53.1       49.4      50.2
     2 CU1M        37.6      65.2       59.5      62.6
     3 CU1U        35.5      51         43.7      44.8
     4 CU2D        42        52         47.8      49.3
     5 CU2M        38.2      49         43.1      42.9
     6 CU2U        34.1      53.7       46.5      47  
     7 CU3D        46.1      53.3       49.7      49.4
     8 CU3M        44.5      73.5       61.9      68.2
     9 CU3U        42        52.6       47.0      46.8
    10 CU4D        42        45.3       44.0      44.6
    # ... with 49 more rows
    
    # Now here is the PCdB merged dataset, and we are no longer missing values!
    > PCdB %>% 
    +   group_by(SiteID) %>% 
    +   dplyr::summarise(
    +     min_dBL50=min(DayL50, na.rm=TRUE),
    +     max_dBL50=max(DayL50, na.rm=TRUE),
    +     mean_dBL50=mean(DayL50, na.rm=TRUE),
    +     med_dBL50=median(DayL50, na.rm=TRUE)
    +   )
    # A tibble: 60 x 5
       SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
       <chr>      <dbl>     <dbl>      <dbl>     <dbl>
     1 CU1D        44.8      50         46.8      47  
     2 CU1M        59        63.9       62.3      62.9
     3 CU1U        37.9      46         43.6      44.4
     4 CU2D        42.1      51.6       45.6      44.3
     5 CU2M        38.4      48.3       44.2      45.5
     6 CU2U        39.8      50.7       45.7      46.4
     7 CU3D        46.5      49.5       47.7      47.7
     8 CU3M        67.7      71.2       69.5      69.4
     9 CU3U        43.3      52.6       48.1      48.2
    10 CU4D        43.2      45.3       44.4      44.9
    # ... with 50 more rows