Search code examples
splunksplunk-query

Join 2 tables with multiple conditions


Assume if I type index=endpoints and hit enter in the search bar, I will get results that look something like this:

{
  "user": Jack,
  "os_name": "Windows",
  "hostname": "Windows-JACK-01",
  "pid": "30219",
  "app": "/usr/bin/curl",
  "cmdline": "curl google.com",
  "epoch": "1503452096",
  "type": "processes"
}

. . .

{
  "hostname": "Windows-JACK-01",
  "pid": "30219",
  "app": "/usr/bin/curl",
  "epoch": "1503452096",
  "ip": "123.123.123.123",
  "port": "1234",
  "type": "sockets"
}

. . .

There are two types of data under the same index - sockets and processes. I would like to find a way to combine these two types of data (that are related) so that I could get a richer data that has all the information.

+-------------------------+-----------------+---------+
| hostname | pid | app | osname | ip | port | etc.... |
+-------------------------+-----------------+---------+
| ...      | ... | ... | ...    | x  | y    | ...     |
+-------------------------+-----------------+---------+

The problem is if I simply do something like:

index=endpoints type="processes"
| join left=L right=L WHERE L.pid=R.pid [ search index=endpoints type="sockets" ]

Most of the time, I will get a wrong mapping between the app and the pid because any applications can get assigned to same pid when it's available.

I'm thinking maybe if I add more conditions, it will reduce the inaccuracy rate. For example, instead of L.pid=R.pid only, maybe I could do L.pid=R.pid AND L.hostname=R.hostname AND ...

My naive approach was to add more condition in the WHERE part

index=endpoints type="processes"
| join left=L right=L WHERE (L.pid=R.pid AND L.x=R.x AND...)
  [ search index=endpoints type="sockets" ]

However, it seems like that's not how it works. Any advice?


Solution

  • From your example queries I guess you are an experienced SQL user who is new to Splunk and hasn't read the manual about the join command. join does not accept a where clause nor does it have left or right options. As a best practice, one should avoid join as much as possible since it is very inefficient.

    Try using stats, instead. We use stats for its grouping feature rather than to calculate statistics.

    index=endpoints (type="processes" OR type="sockets")
    | stats values(*) as * by hostname, pid