I have data like this except there are a lot more units with different Names.
structure(list(Date = structure(c(1585551600, 1585555200, 1585558800,
1585562400, 1585566000, 1585569600, 1585573200, 1585576800, 1585580400,
1585584000, 1585587600, 1585591200, 1585594800, 1585598400, 1585602000,
1585605600, 1585609200, 1585612800, 1585616400, 1585620000), class = c("POSIXct",
"POSIXt"), tzone = ""), Name = c("Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5"), Temp = c(23.7, 23.6, 23.6,
23.6, 23.6, 23.5, 23.5, 23.5, 23.4, 23.4, 23.3, 23.3, 23.3, 23.4,
33.8, 37, 40.6, 31.4, 27.8, 30.2), Data.scaled = c(2.0065971204521,
1.96308734902769, 1.96308734902769, 1.96308734902769, 1.96308734902769,
1.91957757760328, 1.91957757760328, 1.91957757760328, 1.87606780617886,
1.87606780617886, 1.83255803475445, 1.83255803475445, 1.83255803475445,
1.87606780617886, 6.40108403431786, 7.79339671989909, 9.35974849117797,
5.35684952013194, 3.79049774885305, 4.83473226303898), deviation_greater_than_2sd = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
)), row.names = 1401:1420, class = "data.frame")
Date Name Temp Data.scaled deviation_greater_than_2sd
1401 2020-03-30 11:00:00 Suwannee 11.5 23.4 1.876 FALSE
1402 2020-03-30 12:00:00 Suwannee 11.5 23.4 1.876 FALSE
1403 2020-03-30 13:00:00 Suwannee 11.5 23.3 1.833 FALSE
1404 2020-03-30 14:00:00 Suwannee 11.5 23.3 1.833 FALSE
1405 2020-03-30 15:00:00 Suwannee 11.5 23.3 1.833 FALSE
1406 2020-03-30 16:00:00 Suwannee 11.5 23.4 1.876 FALSE
1407 2020-03-28 23:00:00 Suwannee 11.5 23.8 2.050 FALSE
1408 2020-03-29 20:00:00 Suwannee 11.5 23.8 2.050 FALSE
1409 2020-03-29 21:00:00 Suwannee 11.5 23.9 2.094 FALSE
1410 2020-03-29 22:00:00 Suwannee 11.5 23.9 2.094 FALSE
1411 2020-03-30 00:00:00 Suwannee 11.5 23.9 2.094 FALSE
1412 2020-03-30 01:00:00 Suwannee 11.5 23.9 2.094 FALSE
1413 2020-03-30 02:00:00 Suwannee 11.5 23.8 2.050 FALSE
1414 2020-03-29 23:00:00 Suwannee 11.5 24.0 2.137 FALSE
1415 2020-03-30 17:00:00 Suwannee 11.5 33.8 6.401 TRUE
1416 2020-03-30 18:00:00 Suwannee 11.5 37.0 7.793 TRUE
1417 2020-03-30 19:00:00 Suwannee 11.5 40.6 9.360 TRUE
1418 2020-03-30 20:00:00 Suwannee 11.5 31.4 5.357 TRUE
1419 2020-03-30 21:00:00 Suwannee 11.5 27.8 3.790 TRUE
1420 2020-03-30 22:00:00 Suwannee 11.5 30.2 4.835 TRUE
What I am trying to is identify when there sensor were pulled out of the water causing temperature spikes.
temp.test <- subset(temp, Name == "Suwannee 11.5")
temp.test <- temp.test[,c("Date", "Name", "Temp")]
temp.test <- temp.test %>%
mutate(Data.scaled = as.numeric(scale(temp.test$Temp)),
deviation_greater_than_2sd = Data.scaled >= 2.05)
I am not sure how to just apply temp.test <- temp.test %>% mutate(Data.scaled = as.numeric(scale(temp.test$Temp)), deviation_greater_than_2sd = Data.scaled >= 2.05)
to all of the Names in the data separately but all run together so I don't have to subset each Name out first.
If I don't do this it runs just fine on the full dataset, but it's just looking for deviation from all the data combined and certain sites have different temperatures, so I'm afraid it will miss anomalies.
Notice that when I run it with other names it misses many of the "Suwannee 11.5" anomalies.
Date Name Temp Data.scaled deviation_greater_than_2sd
37275 2020-11-23 01:00:00 Clammers Cut 21.3 -0.4578 FALSE
37276 2020-11-23 02:00:00 Clammers Cut 21.2 -0.4752 FALSE
37277 2020-11-23 03:00:00 Clammers Cut 21.3 -0.4578 FALSE
37278 2020-11-23 04:00:00 Clammers Cut 21.7 -0.3882 FALSE
37279 2020-11-23 05:00:00 Clammers Cut 21.6 -0.4056 FALSE
37280 2020-11-23 06:00:00 Clammers Cut 21.4 -0.4404 FALSE
37281 2020-11-23 07:00:00 Clammers Cut 21.1 -0.4925 FALSE
37282 2020-11-23 08:00:00 Clammers Cut 21.0 -0.5099 FALSE
37283 2020-11-23 09:00:00 Clammers Cut 21.0 -0.5099 FALSE
37284 2020-11-23 10:00:00 Clammers Cut 21.0 -0.5099 FALSE
37285 2020-11-23 11:00:00 Clammers Cut 20.7 -0.5621 FALSE
37286 2020-11-23 12:00:00 Clammers Cut 20.6 -0.5795 FALSE
37287 2020-11-23 13:00:00 Clammers Cut 20.5 -0.5969 FALSE
37288 2020-11-23 14:00:00 Clammers Cut 20.5 -0.5969 FALSE
37289 2020-11-23 15:00:00 Clammers Cut 20.6 -0.5795 FALSE
37290 2020-11-23 16:00:00 Clammers Cut 21.0 -0.5099 FALSE
37291 2020-11-23 17:00:00 Clammers Cut 21.5 -0.4230 FALSE
37292 2020-11-23 18:00:00 Clammers Cut 22.2 -0.3013 FALSE
37293 2020-03-30 18:00:00 Suwannee 11.5 37.0 2.2723 TRUE
37294 2020-03-30 19:00:00 Suwannee 11.5 40.6 2.8983 TRUE
I was thinking maybe some sort of apply
function? but I am very new to using the apply
functions.
Since your testdata only contains one unique name, I just felt free to varaite the names to make my function clear.
temp <- structure(list(Date = structure(c(1585551600, 1585555200, 1585558800,
1585562400, 1585566000, 1585569600, 1585573200, 1585576800, 1585580400,
1585584000, 1585587600, 1585591200, 1585594800, 1585598400, 1585602000,
1585605600, 1585609200, 1585612800, 1585616400, 1585620000), class = c("POSIXct",
"POSIXt"), tzone = ""), Name = c("Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5", "Suwannee 11.5",
"Suwannee 11.5", "Suwannee 11.5"), Temp = c(23.7, 23.6, 23.6,
23.6, 23.6, 23.5, 23.5, 23.5, 23.4, 23.4, 23.3, 23.3, 23.3, 23.4,
33.8, 37, 40.6, 31.4, 27.8, 30.2), Data.scaled = c(2.0065971204521,
1.96308734902769, 1.96308734902769, 1.96308734902769, 1.96308734902769,
1.91957757760328, 1.91957757760328, 1.91957757760328, 1.87606780617886,
1.87606780617886, 1.83255803475445, 1.83255803475445, 1.83255803475445,
1.87606780617886, 6.40108403431786, 7.79339671989909, 9.35974849117797,
5.35684952013194, 3.79049774885305, 4.83473226303898), deviation_greater_than_2sd = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
)), row.names = 1401:1420, class = "data.frame")
set.seed(5555) #for reprocubility
temp$Name <- sample(c("A","B","C"),NROW(temp),replace = TRUE)
You want to apply your spikedetection to all the data sharing the same Name
seperately. Therefore use the split
-function
nameDataSplits <- split(temp,temp$Name)
Now nameDataSplits
is a list containing dataframes. In each dataframe, the Name
is the same.
To apply your spikedetection for every dataframe in the list, put it in a function like
addSpikes <- function(subdf) {
temp.test <- subdf
temp.test <- temp.test[,c("Date", "Name", "Temp")]
temp.test <- temp.test %>%
mutate(Data.scaled = as.numeric(scale(temp.test$Temp)),
deviation_greater_than_2sd = Data.scaled >= 2.05)
return(temp.test)
}
This function is just copy-pasted from you, feel free to optimize it.
Now you can add the spikes to your data via an lapply
:
spikesAdded <- lapply(nameDataSplits, addSpikes)
To transform the list back to a dataframe, use
spikesAddedDF <- do.call("rbind",spikesAdded)
Update for different thresholds per Name:
You can put your desired Thresholds in a named vector
yourThreshs <- setNames(rnorm(3),c("A","B","C"))
Make sure that the names of yourThreshs
are exactly those in temp$Name
.
Then you can modify the addSpikes
-function with a second agument called e.g. thresh
:
addSpikes <- function(subdf, thresh) {
temp.test <- subdf
temp.test <- temp.test[,c("Date", "Name", "Temp")]
temp.test <- temp.test %>%
mutate(Data.scaled = as.numeric(scale(temp.test$Temp)),
deviation_greater_than_2sd = Data.scaled >= thresh)
return(temp.test)
}
Then do
spikesAdded <- lapply(names(nameDataSplits), function(nam) {
addSpikes(subdf = nameDataSplits[[nam]],
thresh = yourThreshs[[nam]])
})
spikesAddedDF <- do.call("rbind",spikesAdded)
Update for quantile-threshold
If you want to calculate the threshold depending on the data, e.g. with a quantile, you could define the function like
addSpikes <- function(subdf, quantilePercentage = 0.9) {
temp.test <- subdf
temp.test <- temp.test[,c("Date", "Name", "Temp")]
temp.test$Data.scaled <- as.numeric(scale(temp.test$Temp))
quantileThreshold <- quantile(temp.test$Data.scaled, quantilePercentage)
temp.test$deviation_greater_than_2sd <- temp.test$Data.scaled >= quantileThreshold
return(temp.test)
}
Now you can calculate the threshold as a quantile and play around which fits best. You can proceed like above with:
spikesAdded <- lapply(nameDataSplits, addSpikes)
spikesAddedDF <- do.call("rbind",spikesAdded)
Moreover, if you want to do more fancy stuff, you could give the addSpikes
-function a function as argument:
myThreholdGeneratingFunction <- function(x) {
##some code that takes a vector of numerics and calculates a single number,
#e.g. quantile(x,0.9), mean(x),...
}
addSpikes <- function(subdf, thresholdGeneratingFunction = myThreholdGeneratingFunction) {
temp.test <- subdf
temp.test <- temp.test[,c("Date", "Name", "Temp")]
temp.test$Data.scaled <- as.numeric(scale(temp.test$Temp))
quantileThreshold <- myThreholdGeneratingFunction(temp.test$Data.scaled)
temp.test$deviation_greater_than_2sd <- temp.test$Data.scaled >= quantileThreshold
return(temp.test)
}