Search code examples
javaandroidstatisticsmeanstandard-deviation

How to get mean and standard deviation of daytimes in java


For an Android Studio project written in Java, I've got a List of daytimes which collects hours and minutes as integers like this:

List<Integer> times = new ArrayList<>();
int hour = 16;
int minute = 25;
int time = hour * 60 + minute;
times.add(time);

I need the mean and the standard deviation of times in order to achieve a list of non-outlier times. However, the ordinary mean and standard deviation don't seem to work. Here is what I'm doing right now:

private List<String> getNonOutlierTimes() {

   int mean = convertToTime((times.stream().mapToInt(Integer::intValue).sum()) / times.size());
   int sd = (int) calculateStandardDeviation(mean);
   int maxTime = (int) (mean + 1.5 * sd);
   int minTime = (int) (mean - 1.5 * sd);

   List<Integer> nonOutliers = new ArrayList<>();

   for (int i = 0; i < times.size(); i++) {

       if ((times.get(i) <= maxTime) && (times.get(i) >= minTime)) {
                nonOutliers.add(times.get(i));
       }
   }

   List<String> nonOutliersStr = new ArrayList<>();

   for (Integer nonOutlier : nonOutliers) {
        nonOutliersStr.add(convertIntTimesToStr(nonOutlier));
   }

   return nonOutliersStr;
}


private int convertToTime(int a) {

   if ((a < 24*60) && (a >= 0)) {
            return a;
        } else if (a < 0) {
            return 24*60 + a;
        } else {
            return a % (24*60);
        }

}

private double calculateStandardDeviation(int mean) {

        int sum = 0;
        for (int j = 0; j < times.size(); j++) {
            int time = convertToTime(times.get(j));
            sum = sum + ((time - mean) * (time - mean));
        }
        double squaredDiffMean = (double) (sum) / (times.size());

        return (Math.sqrt(squaredDiffMean));
    }


private String convertIntTimesToStr(int time) {

        String hour = (time / 60) + "";
        int minute = time % 60;
        String minuteStr = minute < 10 ? "0" + minute : "" + minute;

        return hour + ":" + minuteStr;
    }

Although all calculations are based on valid statistics, the calculated mean and sd seem irrelevant. For example when the times list contains the following:

225 (03:45 am), 90 (01:30 am), 0 (12:00 am), 1420 (11:40 pm), 730 (12:10 pm)

I need a non-outliers list containing:

1420 (11:40 pm), 0 (12:00 am), 90 (01:30 am), 225 (03:45 am)

where the actual output is:

0 (12:00 am), 90 (01:30 am), 225 (03:45 pm), 730 (12:10 pm)

i.e., I need the mean to be where most of the times are. To be more specific, consider a list of times containing integers 1380 (23:00 or 11:00 pm), 1400 (23:20 or 11:20 pm), and 60 (01:00 am). The mean for these times is 945 (15:45 or 03:45 pm) where I need the mean to lie between 23:00 and 01:00.

I have already found this solution for a list of two times. However, my times.size() is always greater than 2 and I'd also like to calculate the standard deviation, as well. So, I appreciate your help in this regard.

Thanks in advance.


Solution

  • You are not working with real numbers, but with numbers modulo 1440. Division by a natural number is not well defined in this context or better n x = a has n solutions for each a. E.g. 3 x = 300 has as solutions 300 / 3, 1740 / 3 and 3180 / 3 (300, 1740 and 3180 are different representations of the same element 300).

    Therefore you cannot talk about arithmetic mean in the context of time of the day. However the distance between two times of the day is well-defined: the distance between 21:00 and 23:00 is 2 hours as well as the distance between 23:00 and 1:00. Hence we can take another definition of "mean":

    • let's call mean the time of day that minimizes the sum of square distances from the data. That is a property of the usual mean of real numbers.

    Fortunately one can prove, that this new mean is one of the solutions of n x = sum of values. What changes between these solutions is the sum of square distances from the data and we have to choose the minimal one.

    Assume we have a list of LocalTimes:

       private static final long            DAY      = TimeUnit.DAYS.toSeconds(1L);
       private static final double          HALF_DAY = DAY / 2;
       private static final List<LocalTime> times    = Arrays.asList(
             LocalTime.of(3, 45),
             LocalTime.of(1, 30),
             LocalTime.of(0, 0),
             LocalTime.of(23, 40),
             LocalTime.of(12, 10));
    

    We can compute the average and sum of squares in the "usual" determination (I do it in seconds so between 0 and 86400):

       public static void printMeanVariance(final List<LocalTime> times) {
          final List<Double> dTimes = times.stream().mapToDouble(LocalTime::toSecondOfDay).boxed().collect(Collectors.toList());
          dTimes.sort(Double::compareTo);
          // A valid 'mean' must have max - HALF_DAY < mean < min + HALF_DAY
          double max = dTimes.get(dTimes.size() - 1);
          int count = 0;
          double sum = 0.0, sumOfSquares = 0.0;
          for (final Double time : dTimes) {
             count++;
             sum += time;
             sumOfSquares += time * time;
          }
          // to be continued...
    

    If this is the "mean" it must satisfy two conditions:

    1. The "mean" must be between max + DAY and min + DAY, where min and max are the minimal and maximal value in the current determination,
    2. The usual variance must by minimal.

    We check these conditions for all determinations by adding every time 86400 to the minimal value:

          // continuation
          double average = -1;
          double sumOfDistancesSquared = Double.MAX_VALUE;
          for (final Double time : dTimes) {
             // Check if previous value is admissible
             final double tmpAverage = sum / count;
             final double tmpSumOfDistancesSquared = sumOfSquares - sum * sum / count;
             if (max - HALF_DAY <= tmpAverage && tmpAverage <= time + HALF_DAY && tmpSumOfDistancesSquared < sumOfDistancesSquared) {
                average = tmpAverage;
                sumOfDistancesSquared = tmpSumOfDistancesSquared;
             }
             sum += DAY;
             max = time + DAY;
             sumOfSquares += DAY * (2 * time + DAY);
          }
          // average has the "real" mean
          double sd = Math.sqrt(sumOfDistancesSquared / (count - 1));
          System.out.println("Mean = " + LocalTime.ofSecondOfDay((long) average) +
            ", deviation = " + Duration.ofSeconds((long) sd));
       }
    }