I have a Pandas DataFrame, from a source that I can't change, in the following format:
import pandas as pd
json_str_01 = '''[{"my_key": 1, "my_values": ["3_13"]}, {"my_key": 2, "my_values": ["3_18", "2_12"]}, {"my_key": 3, "my_values": ["2_10", "2_17"]}]'''
json_str_02 = '''[{"my_key": 1, "my_values": ["2_17"]}, {"my_key": 2, "my_values": ["2_12", "3_16"]}, {"my_key": 3, "my_values": ["3_13", "2_8"]}, {"my_key": 4, "my_values": ["3_18", "3_17"]}]'''
data = {
"search_str": ["3_13", "2_8"],
"json_data_str": [json_str_01, json_str_02]
}
df = pd.DataFrame(data)
Jupyter notebook snapshot of the DataFrame:
Each row has two columns, one is a string and the other is likely Json (I am not explicitly told that it is Json, but it can be parsed using Json libraries).
The search_str column corresponds to at least one of the string items. The task is to pull the corresponding key in a new column. To avoid any conflict in the future, the solution should return the first key in case of multiple matches.
The following code is giving the right results. I want to have a vectorized/optimized solution where I need expert input.
def json_reverse_lookup(search_str, json_data_str):
df_json_data = pd.read_json(json_data_str)
for idx, row in df_json_data.iterrows():
if search_str in row[1]:
return row[0]
return None
df['result'] = df.apply(lambda x: json_reverse_lookup(x['search_str'], x['json_data_str']), axis=1)
print(df)
You cannot vectorize this operation in an efficient way. Your best shot is most likely a python loop.
You could improve your code avoiding all pandas functions:
import json
def search_in_json(search_str, json_data_str):
return next((d['my_key'] for d in json.loads(json_data_str)
for v in d['my_values'] if v == search_str), None)
df['result'] = [search_in_json(s, j) for s, j in
zip(df['search_str'], df['json_data_str'])]
Output:
search_str json_data_str result
0 3_13 [{"my_key": 1, "my_values": ["3_13"]}, {"my_ke... 1
1 2_8 [{"my_key": 1, "my_values": ["2_17"]}, {"my_ke... 3