This table is on our impala cluster, although, the impala shell seems very similar to SQL. I have very little SQL or Impala experience, outside of doing extremely simple queries so this problem might be way above my experience and pay grade.
I need to get some data from our db using impala. There are other columns in the table but my command grabs these 3 columns I'm interested in. The data/result can look like this:
time | server | data
0 | A | 500
0 | B | 200
0 | C | 300
1 | A | 100
1 | A | 400
1 | B | 200
1 | C | 300
2 | A | 900
2 | B | 800
2 | C | 700
2 | C | 600
Ideally, the data should look like time 0, where each server only shows up once for each time value. However, there's a bug and as shown, a server can show up more than once for a time value. And it's not the same server that shows up more than once per time value either, as shown in the example above.
In any case, what impala-shell/SQL command do I read up/look up to tell it to select distinct time and server but to choose the larger of the two data values? I don't know how to proceed past:
select distinct time, server from table;
If this is too hard or complicated, perhaps someone can point me to a command that selects distinct time and server but prints the first value of data it finds for the distinct pair of time and server.
To get the unique time and server values along with the largest data value you could use a query with a GROUP BY
, as follows:
SELECT TIME, SERVER, MAX(DATA)
FROM TABLE
GROUP BY TIME, SERVER
ORDER BY TIME, SERVER;
Hope this helps.
Share and enjoy.