Recently I'm trying to find the median of a stream of numbers with the following conditions:
The input is repeated 3 times, including n, the number of integers, followed by n integers a_i such that:
The format of an input data is shown as follows:
5
1 3 4 2 5
5
1 3 4 2 5
5
1 3 4 2 5
My code so far is shown as follows:
#ifdef STREAMING_JUDGE
#include "io.h"
#define next_token io.next_token
#else
#include<string>
#include<iostream>
using namespace std;
string next_token()
{
string s;
cin >> s;
return s;
}
#endif
#include<cstdio>
#include<cstdlib>
#include<vector>
#include<algorithm>
#include<iostream>
#include<math.h>
using namespace std;
int main()
{
srand(time(NULL));
//1st pass: randomly choose sqrt(n) numbers from the given stream of numbers
int n = atoi(next_token().c_str());
int p = (int)ceil(sqrt(n));
vector<int> a;
for(int i=0; i<n; i++)
{
int s=atoi(next_token().c_str());
if( rand()%p == 0 && (int)a.size() < p )
{
a.push_back(s);
}
}
sort(a.begin(), a.end());
//2nd pass: find the k such that the median lies in a[k] and a[k+1], and find the rank of the median between a[k] and a[k+1]
next_token();
vector<int> rank(a.size(),0);
for( int j = 0; j < (int)a.size(); j++ )
{
rank.push_back(0);
}
for( int i = 0; i < n; i++ )
{
int s=atoi(next_token().c_str());
for( int j = 0; j < (int)rank.size(); j++ )
{
if( s<=a[j] )
{
rank[j]++;
}
}
}
int median = 0;
int middle = (n+1)/2;
int k;
if( (int)a.size() == 1 && rank.front() == middle )
{
median=a.front();
cout << median << endl;
return 0;
}
for( int j = 0; j < (int)rank.size(); j++ )
{
if( rank[j] == middle )
{
cout << rank[j] << endl;
return 0;
}
else if( rank[j] < middle && rank[j+1] > middle )
{
k = j;
break;
}
}
//3rd pass: sort the numbers in (a[k], a[k+1]) to find the median
next_token();
vector<int> FinalRun;
if( rank.empty() )
{
for(int i=0; i<n; i++)
{
a.push_back(atoi(next_token().c_str()));
}
sort(a.begin(), a.end());
cout << a[n>>1] << endl;
return 0;
}
else if( rank.front() > middle )
{
for( int i = 0; i < n; i++ )
{
int s = atoi(next_token().c_str());
if( s < a.front() ) FinalRun.push_back(s);
}
sort( FinalRun.begin(), FinalRun.end() );
cout << FinalRun[middle-1] << endl;
return 0;
}
else if ( rank.back() < middle )
{
for( int i = 0; i < n; i++ )
{
int s = atoi(next_token().c_str());
if( s > a.back() ) FinalRun.push_back(s);
}
sort( FinalRun.begin(), FinalRun.end() );
cout << FinalRun[middle-rank.back()-1] << endl;
return 0;
}
else
{
for( int i = 0; i < n; i++ )
{
int s = atoi(next_token().c_str());
if( s > a[k] && s < a[k+1] ) FinalRun.push_back(s);
}
sort( FinalRun.begin(), FinalRun.end() );
cout << FinalRun[middle-rank[k]-1] << endl;
return 0;
}
}
But I still cannot reach the O(nlogn) time complexity. I guess that the bottleneck is in the ranking part (i.e. finding the rank of the median in (a[k], a[k+1]) by finding the ranks of the sampled a[i]'s in the input stream of numbers.) in the 2nd pass. This part has O(nsqrt(n)) in my code.
But I have no idea about how to improve the efficiency of ranking...... Is there any suggestion for efficiency improvement? Thanks in advance!
Further explanation of "rank": the rank of a sampled number calculates the number of numbers in the stream less than or equal to the sampled number. For instance: In the input given as above, if the numbers a[0]=2, a[1]=4, and a[2]=5 are sampled, then rank[0]=2 because there are two numbers (1 and 2) in the stream less than or equal to a[0].
Thanks for all of your help. Especially @alexeykuzmin0 's suggestion can indeed speed up the 2nd pass to O(n*logn) time. But there is a remaining issue: In the 1st pass, I sample the numbers with the probability 1/sqrt(n). When there is no number sampled (the worst case), the vector a is empty, causing that the following passes cannot be executed (i.e., a segmentation fault (core dumped) occurs). @Aconcagua, what do do mean "select all remaining elements, if there aren't more than required any more"? Thanks.
You right, your second part works in O(n√n)
time:
for( int i = 0; i < n; i++ ) // <= n iterations
...
for( int j = 0; j < (int)rank.size(); j++ ) // <= √n iterations
To fix this, we need to get rid of the inner loop. For example, instead of directly calculating amount of elements of initial array that are less than a threshold, we could first calculate amount of elements of the array falling into each interval:
// Same as in your code
for (int i = 0; i < n; ++i) {
int s = atoi(next_token().c_str());
// Find index of interval in O(log n) time
int idx = std::upper_bound(a.begin(), a.end(), s) - a.begin();
// Increase the rank of only that interval
++rank[idx];
}
And then calculate ranks of your threshold elements:
std::partial_sum(rank.begin(), rank.end(), rank.begin());
The resulting complexity is O(n log n) + O(n) = O(n log n)
.
Here I used two STL algorithms:
std::upper_bound
which finds a first element in a sorted array which is strictly greater than given number in logarithmic time, using binary search method.std::partial_sum
which calculates partial sums of an array given in a linear time.