Let's take the traffic camera case as the example. Say, I have a huge dataset with the traffic camera records, which look like : plate_no | camera_id | city | location | direction | timestamp | etc | etc.
I'd like to get the results meet all of the followings:
the approach we are taking now is index the data in SolrCloud, then get three result sets like 1&2, 1&3, 1&4, then use SparkSQL load them as RDD. at the end, use Spark to intersect the three result sets. Finally, you get the cars that showed up around George street in Oct1-3, Oct4-5 and Oct-20, But the performance is not ideal, it takes a lot of time for Spark SQL to query Solr and form the RDDs. We have only tested around 0.5 billion records, such query costs 1-2 mins~. It gets worse when we increase the dataset. We are looking for something around 30 secs~.
Just wondering what'd be the right tool for this task or is there any better approach than this.
Finally, We have found a reasonable solution for this, we put our hive table in 'Parquet+Snappy' format, then, we just used SparkSQL to query against different time period and fetch the different result sets, like 1&2, 1&3, 1&4. At the end, we did an 'intersect' on the result sets.
We got ride of Solr in this case completely. The performance is quite sound at the end.
e.g. SparkSQL does support Like query, intesect