Search code examples
jsonjqfrequency-distribution

For a simple key-value pair list JSON, use jq to print a summary by range of values


Consider the following JSON having a list of key-value pairs

{
  "session1": 128,
  "session2": 1048596,
  "session3": 3145728,
  "session4": 3145828,
  "session5": 11534338,
  "session6": 11544336,
  "session7": 2097252
}

The key is a session identifier, and the value is the length of the value stored in the session.

I want to print counts of values by range - the ranges being (lower bound included, high bound excluded); 0-1MB, 1-2MB, 2-3MB, ... 12-13MB.

 1MB =  1048576
 2MB =  2097152
 3MB =  3145728
 4MB =  4194304
 5MB =  5242880
 6MB =  6291456
 7MB =  7340032
 8MB =  8388608
 9MB =  9437184
10MB = 10485760
11MB = 11534336
12MB = 12582912
13MB = 13631488

The expected output is

{
  "0-1MB": 1,
  "1-2MB": 1,
  "2-3MB": 1,
  "3-4MB": 2,
  "10-11MB": 2
}

The above is just representative, suggestions are welcome.


Solution

  • The following should work:

    to_entries
    | map(.value / 1048576 | floor | [tostring, "-", (.+1 | tostring), "MB"] | add)
    | group_by(.)
    | map({"key": .[0], "value": length})
    | from_entries
    

    For your input, it produces the following output:

    {
      "0-1MB": 1,
      "1-2MB": 1,
      "11-12MB": 2,
      "2-3MB": 1,
      "3-4MB": 2
    }
    

    (11534338 and 11544336 are counted in the "11-12MB" bucket rather than the "10-11MB" one, because 11*2^20 = 11534336, and those numbers are larger than that.)

    If you wanted the keys in numeric order, you could also convert them to your preferred string labels after the group_by:

    to_entries
    | map(.value / 1048576 | floor)
    | group_by(.)
    | map({"key": [(.[0] | tostring), "-", (.[0]+1 | tostring), "MB"] | add, "value": length})
    | from_entries
    

    Which produces:

    {
      "0-1MB": 1,
      "1-2MB": 1,
      "2-3MB": 1,
      "3-4MB": 2,
      "11-12MB": 2
    }
    

    Both solutions have the same basic steps:

    1. Convert the input object to an array of {"key": x, "value": y} entries (to_entries).
    2. Map the entries into something that identifies the range they're in, by rouding down to the nearest megabyte (.value / 1048576 | floor).
    3. Group by the value (group_by). This produces an array like [[0], [1], [2], [3, 3], [11, 11]] for your input.
    4. For each group, produce an entry where the "key" field is the range label ("X-YMB") and the "value" is the number of elements in the group (length).
    5. Convert the list of entries back to a single object (from_entries).