Search code examples
scalastatisticssummary

summary statistics in scala


How can I elegantly calculate summary statistics (e.g. mean variance) elegantly in scala per group here in this example per each different metric(name)?

case class MeasureUnit(name: String, value: Double)

Seq(MeasureUnit("metric1", 0.04), MeasureUnit("metric1", 0.09),
  MeasureUnit("metric2", 0.64), MeasureUnit("metric2", 0.34), MeasureUnit("metric2", 0.84))

An excellent example how to calculate mean /variance per property is https://chrisbissell.wordpress.com/2011/05/23/a-simple-but-very-flexible-statistics-library-in-scala/ but that does not cover the grouping.


Solution

  • You can use Seq#groupBy

    val measureSeq : Seq[MeasureUnit] = ???
    
    type Name = String
    
    // "metric1" -> Seq(0.04, 0.09), "metric2" -> Seq(0.64, 0.34, 0.84)
    val groupedMeasures : Map[Name, Seq[Double]] = 
      measureSeq
        .groupBy(_.name)
        .mapValues(_ map (_.value))
    

    The groupings can then be used to calculate your summary statistics:

    type Mean = Double
    
    val meanMapping : Map[Name, Mean] = 
      groupedMeasures mapValues { v => mean(v) }
    
    type Variance = Double
    
    val varianceMapping : Map[Name, Variance] = 
      groupedMeasures mapValues { v => variance(v) }
    

    Or you can map each name to a tuple of statistics:

    type Summary = Tuple2[Mean, Variance]
    
    val summaryMapping : Map[Name, Summary] = 
      groupedMeasures mapValues {s => (mean(s), variance(s)) }