Search code examples
rmodebin

How to identify max value of one variable based on bins of a separate variable


I've seen a few binning questions, but haven't seen a solution to this case. Within a group_by condition, I'm trying to identify the mode, but the challenge is that the mode should consider the quantity of each observation (row), as defined by another column.

Within my data, each row represents an observation at a given time, and one column has speed while another has quantity values. If I run statistics on the speed it's ignoring the actual quantity during each observation. The speed is a continuous variable, so I know I want to bin (say 0-80 at increments of 5), and then sum the quantity of each bin, and finally report the speed bin with the highest quantity (a value that will be used in a separate calculation).

The bin label would preferably be the mid point (45-50 would be listed as 47.5). This would be run through a group of observations.

I've seen count(cut_width()) but that's just observation counts, and not sure how to find max quantity. Thank you.


Solution

  • Some of my colleagues provided some good direction on this, and I found more content online. One of the best ways is to look for a kde or density function that allows weights to influence the distribution. In my case, I assigned a weight from the number of vehicles observed (Quantity) with each speed observation.

    That direction led me here: https://rmflight.github.io/post/finding-modes-using-kernel-density-estimates/

    Which has a great way to find the mode from a density function, so I only modified the density condition to add weight, and then set a bin width.

    density_estimate <- density(data.calc$Speed, weights=data.calc$Quantity, bw=1) 
    

    and then the rest of the code from the github site

    mode_value <- density_estimate$x[which.max(density_estimate$y)]
    mode_value
    

    My data is evaluated in groups, so I placed this to a loop (which I know people don't love) and was able to evaluate the mode by different time intervals. Maybe this is all obvious, but I'm still learning and happy to find this method works.