Search code examples
pythonjsonpandasdataframereverse-lookup

Reverse lookup in a json string with key value pairs


I have a Pandas DataFrame, from a source that I can't change, in the following format:

import pandas as pd

json_str_01 = '''[{"my_key": 1, "my_values": ["3_13"]}, {"my_key": 2, "my_values": ["3_18", "2_12"]}, {"my_key": 3, "my_values": ["2_10", "2_17"]}]'''
json_str_02 = '''[{"my_key": 1, "my_values": ["2_17"]}, {"my_key": 2, "my_values": ["2_12", "3_16"]}, {"my_key": 3, "my_values": ["3_13", "2_8"]}, {"my_key": 4, "my_values": ["3_18", "3_17"]}]'''

data = {
  "search_str": ["3_13", "2_8"],
  "json_data_str": [json_str_01, json_str_02]
}

df = pd.DataFrame(data)

Jupyter notebook snapshot of the DataFrame:

enter image description here

Each row has two columns, one is a string and the other is likely Json (I am not explicitly told that it is Json, but it can be parsed using Json libraries).

  • The Json cell can have variable number of key-value pairs.
  • Value is a list with one or more string items.

The search_str column corresponds to at least one of the string items. The task is to pull the corresponding key in a new column. To avoid any conflict in the future, the solution should return the first key in case of multiple matches.

The following code is giving the right results. I want to have a vectorized/optimized solution where I need expert input.

def json_reverse_lookup(search_str, json_data_str):
    df_json_data = pd.read_json(json_data_str)

    for idx, row in df_json_data.iterrows():
        if search_str in row[1]:
            return row[0]
    
    return None

df['result'] = df.apply(lambda x: json_reverse_lookup(x['search_str'], x['json_data_str']), axis=1)
print(df)

Solution

  • You cannot vectorize this operation in an efficient way. Your best shot is most likely a python loop.

    You could improve your code avoiding all pandas functions:

    import json
    
    def search_in_json(search_str, json_data_str):
        return next((d['my_key'] for d in json.loads(json_data_str)
                     for v in d['my_values'] if v == search_str), None)
    
    df['result'] = [search_in_json(s, j) for s, j in
                    zip(df['search_str'], df['json_data_str'])]
    

    Output:

      search_str                                      json_data_str  result
    0       3_13  [{"my_key": 1, "my_values": ["3_13"]}, {"my_ke...       1
    1        2_8  [{"my_key": 1, "my_values": ["2_17"]}, {"my_ke...       3