Search code examples
rstandard-deviationtemperatureanomaly-detection

Temperature checks using scale and standard deviation in R


I have data like this except there are a lot more units with different Names.

structure(list(Date = structure(c(1585551600, 1585555200, 1585558800, 
1585562400, 1585566000, 1585569600, 1585573200, 1585576800, 1585580400, 
1585584000, 1585587600, 1585591200, 1585594800, 1585598400, 1585602000, 
1585605600, 1585609200, 1585612800, 1585616400, 1585620000), class = c("POSIXct", 
"POSIXt"), tzone = ""), Name = c("Suwannee 11.5", "Suwannee 11.5", 
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
"Suwannee 11.5", "Suwannee 11.5"), Temp = c(23.7, 23.6, 23.6, 
23.6, 23.6, 23.5, 23.5, 23.5, 23.4, 23.4, 23.3, 23.3, 23.3, 23.4, 
33.8, 37, 40.6, 31.4, 27.8, 30.2), Data.scaled = c(2.0065971204521, 
1.96308734902769, 1.96308734902769, 1.96308734902769, 1.96308734902769, 
1.91957757760328, 1.91957757760328, 1.91957757760328, 1.87606780617886, 
1.87606780617886, 1.83255803475445, 1.83255803475445, 1.83255803475445, 
1.87606780617886, 6.40108403431786, 7.79339671989909, 9.35974849117797, 
5.35684952013194, 3.79049774885305, 4.83473226303898), deviation_greater_than_2sd = c(FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
)), row.names = 1401:1420, class = "data.frame")
                    Date          Name Temp Data.scaled deviation_greater_than_2sd
1401 2020-03-30 11:00:00 Suwannee 11.5 23.4       1.876                      FALSE
1402 2020-03-30 12:00:00 Suwannee 11.5 23.4       1.876                      FALSE
1403 2020-03-30 13:00:00 Suwannee 11.5 23.3       1.833                      FALSE
1404 2020-03-30 14:00:00 Suwannee 11.5 23.3       1.833                      FALSE
1405 2020-03-30 15:00:00 Suwannee 11.5 23.3       1.833                      FALSE
1406 2020-03-30 16:00:00 Suwannee 11.5 23.4       1.876                      FALSE
1407 2020-03-28 23:00:00 Suwannee 11.5 23.8       2.050                      FALSE
1408 2020-03-29 20:00:00 Suwannee 11.5 23.8       2.050                      FALSE
1409 2020-03-29 21:00:00 Suwannee 11.5 23.9       2.094                      FALSE
1410 2020-03-29 22:00:00 Suwannee 11.5 23.9       2.094                      FALSE
1411 2020-03-30 00:00:00 Suwannee 11.5 23.9       2.094                      FALSE
1412 2020-03-30 01:00:00 Suwannee 11.5 23.9       2.094                      FALSE
1413 2020-03-30 02:00:00 Suwannee 11.5 23.8       2.050                      FALSE
1414 2020-03-29 23:00:00 Suwannee 11.5 24.0       2.137                      FALSE
1415 2020-03-30 17:00:00 Suwannee 11.5 33.8       6.401                       TRUE
1416 2020-03-30 18:00:00 Suwannee 11.5 37.0       7.793                       TRUE
1417 2020-03-30 19:00:00 Suwannee 11.5 40.6       9.360                       TRUE
1418 2020-03-30 20:00:00 Suwannee 11.5 31.4       5.357                       TRUE
1419 2020-03-30 21:00:00 Suwannee 11.5 27.8       3.790                       TRUE
1420 2020-03-30 22:00:00 Suwannee 11.5 30.2       4.835                       TRUE

What I am trying to is identify when there sensor were pulled out of the water causing temperature spikes.

temp.test <- subset(temp, Name == "Suwannee 11.5")

temp.test <- temp.test[,c("Date", "Name", "Temp")]

temp.test <- temp.test %>%
  mutate(Data.scaled = as.numeric(scale(temp.test$Temp)),
    deviation_greater_than_2sd = Data.scaled >= 2.05)

I am not sure how to just apply temp.test <- temp.test %>% mutate(Data.scaled = as.numeric(scale(temp.test$Temp)), deviation_greater_than_2sd = Data.scaled >= 2.05) to all of the Names in the data separately but all run together so I don't have to subset each Name out first.

If I don't do this it runs just fine on the full dataset, but it's just looking for deviation from all the data combined and certain sites have different temperatures, so I'm afraid it will miss anomalies.

Notice that when I run it with other names it misses many of the "Suwannee 11.5" anomalies.

                     Date          Name Temp Data.scaled deviation_greater_than_2sd
37275 2020-11-23 01:00:00  Clammers Cut 21.3     -0.4578                      FALSE
37276 2020-11-23 02:00:00  Clammers Cut 21.2     -0.4752                      FALSE
37277 2020-11-23 03:00:00  Clammers Cut 21.3     -0.4578                      FALSE
37278 2020-11-23 04:00:00  Clammers Cut 21.7     -0.3882                      FALSE
37279 2020-11-23 05:00:00  Clammers Cut 21.6     -0.4056                      FALSE
37280 2020-11-23 06:00:00  Clammers Cut 21.4     -0.4404                      FALSE
37281 2020-11-23 07:00:00  Clammers Cut 21.1     -0.4925                      FALSE
37282 2020-11-23 08:00:00  Clammers Cut 21.0     -0.5099                      FALSE
37283 2020-11-23 09:00:00  Clammers Cut 21.0     -0.5099                      FALSE
37284 2020-11-23 10:00:00  Clammers Cut 21.0     -0.5099                      FALSE
37285 2020-11-23 11:00:00  Clammers Cut 20.7     -0.5621                      FALSE
37286 2020-11-23 12:00:00  Clammers Cut 20.6     -0.5795                      FALSE
37287 2020-11-23 13:00:00  Clammers Cut 20.5     -0.5969                      FALSE
37288 2020-11-23 14:00:00  Clammers Cut 20.5     -0.5969                      FALSE
37289 2020-11-23 15:00:00  Clammers Cut 20.6     -0.5795                      FALSE
37290 2020-11-23 16:00:00  Clammers Cut 21.0     -0.5099                      FALSE
37291 2020-11-23 17:00:00  Clammers Cut 21.5     -0.4230                      FALSE
37292 2020-11-23 18:00:00  Clammers Cut 22.2     -0.3013                      FALSE
37293 2020-03-30 18:00:00 Suwannee 11.5 37.0      2.2723                       TRUE
37294 2020-03-30 19:00:00 Suwannee 11.5 40.6      2.8983                       TRUE

I was thinking maybe some sort of apply function? but I am very new to using the apply functions.


Solution

  • Since your testdata only contains one unique name, I just felt free to varaite the names to make my function clear.

    temp <- structure(list(Date = structure(c(1585551600, 1585555200, 1585558800, 
    1585562400, 1585566000, 1585569600, 1585573200, 1585576800, 1585580400, 
    1585584000, 1585587600, 1585591200, 1585594800, 1585598400, 1585602000, 
    1585605600, 1585609200, 1585612800, 1585616400, 1585620000), class = c("POSIXct", 
    "POSIXt"), tzone = ""), Name = c("Suwannee 11.5", "Suwannee 11.5", 
    "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
    "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
    "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
    "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", 
    "Suwannee 11.5", "Suwannee 11.5"), Temp = c(23.7, 23.6, 23.6, 
    23.6, 23.6, 23.5, 23.5, 23.5, 23.4, 23.4, 23.3, 23.3, 23.3, 23.4, 
    33.8, 37, 40.6, 31.4, 27.8, 30.2), Data.scaled = c(2.0065971204521, 
    1.96308734902769, 1.96308734902769, 1.96308734902769, 1.96308734902769, 
    1.91957757760328, 1.91957757760328, 1.91957757760328, 1.87606780617886, 
    1.87606780617886, 1.83255803475445, 1.83255803475445, 1.83255803475445, 
    1.87606780617886, 6.40108403431786, 7.79339671989909, 9.35974849117797, 
    5.35684952013194, 3.79049774885305, 4.83473226303898), deviation_greater_than_2sd = c(FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
    )), row.names = 1401:1420, class = "data.frame")
    set.seed(5555) #for reprocubility
    temp$Name <- sample(c("A","B","C"),NROW(temp),replace = TRUE)
    

    You want to apply your spikedetection to all the data sharing the same Name seperately. Therefore use the split-function

    nameDataSplits <- split(temp,temp$Name)
    

    Now nameDataSplits is a list containing dataframes. In each dataframe, the Name is the same. To apply your spikedetection for every dataframe in the list, put it in a function like

    addSpikes <- function(subdf) {
      temp.test <- subdf
      temp.test <- temp.test[,c("Date", "Name", "Temp")]
      temp.test <- temp.test %>%
        mutate(Data.scaled = as.numeric(scale(temp.test$Temp)),
               deviation_greater_than_2sd = Data.scaled >= 2.05)
      return(temp.test)
    }
    

    This function is just copy-pasted from you, feel free to optimize it.

    Now you can add the spikes to your data via an lapply:

    spikesAdded <- lapply(nameDataSplits, addSpikes)
    

    To transform the list back to a dataframe, use

    spikesAddedDF <- do.call("rbind",spikesAdded)
    

    Update for different thresholds per Name:

    You can put your desired Thresholds in a named vector

    yourThreshs <- setNames(rnorm(3),c("A","B","C"))
    

    Make sure that the names of yourThreshs are exactly those in temp$Name. Then you can modify the addSpikes-function with a second agument called e.g. thresh:

    addSpikes <- function(subdf, thresh) {
      temp.test <- subdf
      temp.test <- temp.test[,c("Date", "Name", "Temp")]
      temp.test <- temp.test %>%
        mutate(Data.scaled = as.numeric(scale(temp.test$Temp)),
               deviation_greater_than_2sd = Data.scaled >= thresh)
      return(temp.test)
    }
    

    Then do

    spikesAdded <- lapply(names(nameDataSplits), function(nam) {
      addSpikes(subdf = nameDataSplits[[nam]],
                thresh = yourThreshs[[nam]])
    })
    spikesAddedDF <- do.call("rbind",spikesAdded)
    

    Update for quantile-threshold

    If you want to calculate the threshold depending on the data, e.g. with a quantile, you could define the function like

    addSpikes <- function(subdf, quantilePercentage = 0.9) {
      temp.test <- subdf
      temp.test <- temp.test[,c("Date", "Name", "Temp")]
      temp.test$Data.scaled <- as.numeric(scale(temp.test$Temp))
      quantileThreshold <- quantile(temp.test$Data.scaled, quantilePercentage)
      temp.test$deviation_greater_than_2sd <- temp.test$Data.scaled >= quantileThreshold
      return(temp.test)
    }
    

    Now you can calculate the threshold as a quantile and play around which fits best. You can proceed like above with:

    spikesAdded <- lapply(nameDataSplits, addSpikes)
    spikesAddedDF <- do.call("rbind",spikesAdded)
    

    Moreover, if you want to do more fancy stuff, you could give the addSpikes-function a function as argument:

    myThreholdGeneratingFunction <- function(x) {
      ##some code that takes a vector of numerics and calculates a single number,
      #e.g. quantile(x,0.9), mean(x),...
    }
    
    addSpikes <- function(subdf, thresholdGeneratingFunction = myThreholdGeneratingFunction) {
      temp.test <- subdf
      temp.test <- temp.test[,c("Date", "Name", "Temp")]
      temp.test$Data.scaled <- as.numeric(scale(temp.test$Temp))
      quantileThreshold <- myThreholdGeneratingFunction(temp.test$Data.scaled)
      temp.test$deviation_greater_than_2sd <- temp.test$Data.scaled >= quantileThreshold
      return(temp.test)
    }