Search code examples
pythonpandasjson-normalize

Find the longest group after groupby on normalized json in pandas


My code below groups by values and creates a list of values that were once the length of arrays. But how can I return the id that has the largest sum of each number in the elements:

Original Json read into df (not same data as printed because it was too long)

{
   "kind":"admin#reports#activities",
   "etag":"\"5g8\"",
   "nextPageToken":"A:1651795128914034:-4002873813067783265:151219070090:C02f6wppb",
   "items":[
      {
         "kind":"admin#reports#activity",
         "id":{
            "time":"2022-05-05T23:59:39.421Z",
            "uniqueQualifier":"5526793068617678141",
            "applicationName":"token",
            "customerId":"cds"
         },
         "etag":"\"jkYcURYoi8\"",
         "actor":{
            "email":"blah@blah.net",
            "profileId":"1323"
         },
         "ipAddress":"107.178.193.87",
         "events":[
            {
               "type":"auth",
               "name":"activity",
               "parameters":[
                  {
                     "name":"api_name",
                     "value":"admin"
                  },
                  {
                     "name":"method_name",
                     "value":"directory.users.list"
                  },
                  {
                     "name":"client_id",
                     "value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
                  },
                  {
                     "name":"num_response_bytes",
                     "intValue":"7158"
                  },
                  {
                     "name":"product_bucket",
                     "value":"GSUITE_ADMIN"
                  },
                  {
                     "name":"app_name",
                     "value":"Untitled project"
                  },
                  {
                     "name":"client_type",
                     "value":"WEB"
                  }
               ]
            }
         ]
      },
      {
         "kind":"admin#reports#activity",
         "id":{
            "time":"2022-05-05T23:58:48.914Z",
            "uniqueQualifier":"-4002873813067783265",
            "applicationName":"token",
            "customerId":"df"
         },
         "etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
         "actor":{
            "email":"blah.blah@bebe.net",
            "profileId":"1324"
         },
         "ipAddress":"54.80.168.30",
         "events":[
            {
               "type":"auth",
               "name":"activity",
               "parameters":[
                  {
                     "name":"api_name",
                     "value":"gmail"
                  },
                  {
                     "name":"method_name",
                     "value":"gmail.users.messages.list"
                  },
                  {
                     "name":"client_id",
                     "value":"927538837578.apps.googleusercontent.com"
                  },
                  {
                     "name":"num_response_bytes",
                     "intValue":"2"
                  },
                  {
                     "name":"product_bucket",
                     "value":"GMAIL"
                  },
                  
                  {
                     "name":"client_type",
                     "value":"WEB"
                  }
               ]
            }
         ]
      }
   ]
}

current code:

df = pd.json_normalize(response['items'])
    df['test'] = df.groupby('actor.profileId')['events'].apply(lambda x: [len(x.iloc[i][0]['parameters']) for i in range(len(x))])

output:

ID
1002306    [7, 7, 7, 5]
1234444    [3,5,6]
1222222    [1,3,4,5]

desired output

id       total
1002306  26

Sorry had to fill up more space, as there was so much code


Solution

  • There’s no need to construct the intermediate df and do groupby on it. You can use pass the record and meta paths to json_normalize to directly flatten the json data. Then your job seems to be about counting the number of rows per actor.profileId and finding the maximum.

    df = pd.json_normalize(response['items'], ['events','parameters'], ['actor'])
    df['actor.profileId'] = df['actor'].str['profileId']
    out = df.value_counts('actor.profileId').pipe(lambda x: x.iloc[[0]])
    

    Output:

    actor.profileId
    1323    7
    dtype: int64