I have an airline dataset from stat computing which I am trying to analyse.
There are variables DepTime and ArrDelay (Departure Time and Arrival Delay). I am trying to analyse how Arrival Delay is varying with certain chunks of departure time. My objective is to find which time chunks should a person avoid while booking their tickets to avoid arrival delay
My understanding-If a one tailed t test between arrival delays for dep time >1800 and arrival delays for dep time >1900 show a high significance, it means that one should avoid flights between 1800 and 1900. ( Please correct me if I am wrong). I want to run such tests for all departure hours.
**Totally new to programming and Data Science. Any help would be much appreciated.
Data looks like this. The highlighted columns are the ones I am analysing
Sharing an image of the data is not the same as providing the data for us to work with...
That said I went and grabbed one year of data and worked this up.
flights <- read.csv("~/Downloads/1995.csv", header=T)
flights <- flights[, c("DepTime", "ArrDelay")]
flights$Dep <- round(flights$DepTime-30, digits = -2)
head(flights, n=25)
# This tests each hour of departures against the entire day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the day as a whole.
pVsDay <- tapply(flights$ArrDelay, flights$Dep,
function(x) t.test(x, flights$ArrDelay, alternative = "less"))
# This tests each hour of departures against every other hour of the day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the other hours.
pAllvsAll <- tapply(flights$ArrDelay, flights$Dep,
function(x) tapply(flights$ArrDelay, flights$Dep, function (z)
t.test(x, z, alternative = "less")))
I'll let you figure out multiple hypothesis testing and the like.