I'm thinking about developing a system to perform highly parallel queries on nested (but tree-like) data. The potential users are data analysts (physicists, specifically), not programmers. For the user interface, I want to use a well-known query language to avoid proliferating new languages.
Most of the data would be structured like this (imagine the following schema for billions of event
structures):
event: struct
|
+--- timestamp: bigint
+--- missing energy: float
+--- tracks: array of struct
| |
| +--- momentum: float
| +--- theta angle: float
| +--- hits: array of struct
| |
| +--- detector id: int
| +--- charge: float
| +--- time: float
| +--- ...
+--- showers: array of struct
|
+--- ...
The database would be read-only, and most of the queries would be things like:
et cetera. What these queries have in common is that they all resolve to one scalar per event, though they delve into the arrays of structs to do it. They perform "reduce" like operations (generally fold
in Scala, aggregate
in Spark, DAF in SQL) across filtered, transformed subsets of those arrays. I could write them in Scala like this:
// missing check for when zero tracks passed filter!
{event => event.tracks // get list of tracks
.filter(abs(_.theta) < 2.4) // in theta range
.maxBy(_.hits.size) // take the one with the most hits
.momentum // return its momentum
}
{event => mean(
event.tracks // get list of tracks
.filter(_.momentum > 10) // in momentum range
.flatMap(_.hits) // explode to hits
.filter(_.time < 10) // in time range
.map(_.charge) // return their charges
)} // ... to the mean function
// again missing check for less than two tracks!
{event => val List(one, two) = // unpack and assign "one" and "two"
event.tracks // get list of tracks
.sortBy(_.momentum) // sort by momentum
.take(2) // take the first two
// now compute the weighted mean of structs "one" and "two"
(one.theta*one.momentum + two.theta*two.momentum) /
(one.momentum + two.momentum)
}
Why not just use Scala? My program is implemented in C and will run on GPUs. Whatever Scala I bring to it would be a reimplemented subset--- in other words, an invented language. (The same could be said for Haskell, Javascript, or other language that makes heavy use of functions as arguments.)
Also, these queries ought to be declarative. If I implement too much of a general purpose programming language, details like the order of function calls might become relevant.
Why not just use SQL? Is it possible to write queries like the above easily, such that they're readable by anyone other than the author? Queries like the above are the norm, not complex extremes.
SQL supports nested arrays of structs, but all the examples I can find of using that substructure are horrendously complicated. One has to explode the table of events into a table of tracks (or double-explode to get hits), and some complex accounting would be needed to unexplode and get back to one scalar per event.
I suppose I could use SQL with new functions like MAXIMAL(collection, function)
that return a struct from an array, similar to track[12]
but using the user-provided function as an objective function for maximizing, minimizing, finding the top/bottom N, etc. I don't think SQL supports passing functions as arguments. If I write an SQL that does, it would be non-standard.
Is there a widely used dialect of SQL that supports passing functions as arguments?
Or is there another declarative language I should consider?
I posted this in a comment earlier, but moving it here.
I'm with others on the use of a graph database. I'm not familiar with Neo4j queries, but I expect them to be capable. Similarly, SPARQL would work well for this kind of thing.
For the first query, a SPARQL query might look like:
PREFIX : <http://yournamespace.com/accelerator/> .
SELECT ?momentum (MAX(?hitcount) as ?maxhits)
WHERE {
SELECT ?momentum (COUNT(?hits) AS ?hitcount)
WHERE ?track :momentum ?momentum .
?track :theta ?theta .
FILTER (?theta > -2.4 AND ?theta < 2.4) .
?track :hits ?hits
GROUP BY ?track
}
GROUP BY ?momentum;
Identifiers have the : prefix on them because they need to be encoded as URIs. But that's an internal detail for moving to RDF (the data format for SPARQL databases).
The above query is doing sub-queries because you're looking to aggregate (on the count), and then aggregate again (with the max of the count). But you can see that it's all handled in an SQL-like way, and does not require post-processing.