I need to read sorted array from input to awk/gawk and get median. I don't want to store the whole array and am trying to get constant space for the calculation.
Are you aware of any algorithm doing this? Given the array is sorted but its size is unknown.
Thank you in advance!
There is no algorithm to exactly find the median of a sorted sequence of unknown length that runs with a fixed amount of memory.
To see this, consider such an algorithm. Say it has a buffer of length N
for holding items from the sequence. Until this buffer is full, the algorithm simply puts items in it, tracking the median while it does so.
When the algorithm scans the N+1
th item and beyond, it must choose one item to drop at each step. Suppose it has already scanned 2N
items, dropping half of them. Let's give it the benefit of the doubt, and say it has not yet dropped median of the input stream.
Consider when it is scans the 2N+1
th item. Which item should it drop? It can't drop the least element it has kept so far, since input may be exhausted after this item, in which case the lowest may be the median. Likewise for any possible element it may drop, there is a future to the input sequence that makes this dropped element the median.
If you are willing to take approximate results, then this estimator may work for you.