python json pandas dataframe reverse-lookup

Reverse lookup in a json string with key value pairs

I have a Pandas DataFrame, from a source that I can't change, in the following format:

import pandas as pd

json_str_01 = '''[{"my_key": 1, "my_values": ["3_13"]}, {"my_key": 2, "my_values": ["3_18", "2_12"]}, {"my_key": 3, "my_values": ["2_10", "2_17"]}]'''
json_str_02 = '''[{"my_key": 1, "my_values": ["2_17"]}, {"my_key": 2, "my_values": ["2_12", "3_16"]}, {"my_key": 3, "my_values": ["3_13", "2_8"]}, {"my_key": 4, "my_values": ["3_18", "3_17"]}]'''

data = {
  "search_str": ["3_13", "2_8"],
  "json_data_str": [json_str_01, json_str_02]
}

df = pd.DataFrame(data)

Jupyter notebook snapshot of the DataFrame:

Each row has two columns, one is a string and the other is likely Json (I am not explicitly told that it is Json, but it can be parsed using Json libraries).

The Json cell can have variable number of key-value pairs.
Value is a list with one or more string items.

The search_str column corresponds to at least one of the string items. The task is to pull the corresponding key in a new column. To avoid any conflict in the future, the solution should return the first key in case of multiple matches.

The following code is giving the right results. I want to have a vectorized/optimized solution where I need expert input.

def json_reverse_lookup(search_str, json_data_str):
    df_json_data = pd.read_json(json_data_str)

    for idx, row in df_json_data.iterrows():
        if search_str in row[1]:
            return row[0]
    
    return None

df['result'] = df.apply(lambda x: json_reverse_lookup(x['search_str'], x['json_data_str']), axis=1)
print(df)

Solution

You cannot vectorize this operation in an efficient way. Your best shot is most likely a python loop.

You could improve your code avoiding all pandas functions:

import json

def search_in_json(search_str, json_data_str):
    return next((d['my_key'] for d in json.loads(json_data_str)
                 for v in d['my_values'] if v == search_str), None)

df['result'] = [search_in_json(s, j) for s, j in
                zip(df['search_str'], df['json_data_str'])]

Output:

  search_str                                      json_data_str  result
0       3_13  [{"my_key": 1, "my_values": ["3_13"]}, {"my_ke...       1
1        2_8  [{"my_key": 1, "my_values": ["2_17"]}, {"my_ke...       3