I have some data (shown below), which is a dictionary of nested lists of dictionaries. I want to make the whole dictionary into one row. A very wide row. At present I can get my desired result. It is quite long and not very elegant. I'd like to get better at writing more concise code, and hopefully more computationally efficient code.
For context I am doing this for a university project. We aren't getting marked on code elegance or optimisation. This is more for my own benefit.
Being still quite new to python I am not great at working with a sort of nested dictionary of dictionaries and lists of nested dictionaries. Any help would be much appreciated!
Data dictionary
{'attack': [{'stat': 'carries_crossed_gain_line', 'value': '52'},
{'stat': 'carries_metres', 'value': '648'},
{'stat': 'carries_not_made_gain_line', 'value': '64'},
{'stat': 'clean_breaks', 'value': '21'},
{'stat': 'defenders_beaten', 'value': '25'},
{'stat': 'offload', 'value': '16'},
{'stat': 'passes', 'value': '168'},
{'stat': 'runs', 'value': '138'},
{'stat': 'turnovers_conceded', 'value': '16'}],
'defence': [{'stat': 'missed_tackles', 'value': '32'},
{'stat': 'tackles', 'value': '125'},
{'stat': 'turnovers_won', 'value': '6'}],
'discipline': [{'stat': 'penalties_conceded', 'value': '12'},
{'stat': 'red_card_second_yellow', 'value': '0'},
{'stat': 'red_cards', 'value': '0'},
{'stat': 'yellow_cards', 'value': '0'}],
'kicking': [{'stat': 'conversion_goals', 'value': '3'},
{'stat': 'kicks_from_hand', 'value': '10'},
{'stat': 'missed_conversion_goals', 'value': '1'},
{'stat': 'missed_penalty_goals', 'value': '0'},
{'stat': 'penalty_goals', 'value': '2'}],
'breakdown': [{'stat': 'mauls_lost', 'value': '0'},
{'stat': 'mauls_total', 'value': '6'},
{'stat': 'mauls_won', 'value': '6'},
{'stat': 'mauls_won_penalty', 'value': '1'},
{'stat': 'mauls_won_try', 'value': '0'},
{'stat': 'rucks_lost', 'value': '11'},
{'stat': 'rucks_total', 'value': '99'},
{'stat': 'rucks_won', 'value': '88'}],
'lineouts': [{'stat': 'lineout_success', 'value': '0.93'},
{'stat': 'lineout_won_steal', 'value': '1'},
{'stat': 'lineouts_Lost', 'value': '1'},
{'stat': 'lineouts_won', 'value': '14'}],
'scrums': [{'stat': 'scrums_lost', 'value': '0'},
{'stat': 'scrums_success', 'value': '1.00'},
{'stat': 'scrums_won', 'value': '2'}],
'possession': [{'stat': 'possession', 'value': '0.50'},
{'stat': 'pc_possession_first', 'value': '0.50'},
{'stat': 'pc_possession_second', 'value': '0.50'},
{'stat': 'ball_possession_last_10_mins', 'value': '0.61'}]}
Desired outcome a wide row like:
What I've tried I made a function that I can use to call each nested DF like so:
def trim(df):
x = df.transpose()
x.columns = x.iloc[0]
x = x[1:]
return x
Which I call in this manner. altering the last 'attack' to the other stats:
trim(pd.DataFrame(results['home']['team_stats']['attack']))
Then I
This is time consuming. Is there a way that I can do this in a more efficient or pythonic manner? My end goal is to collect this match data for the home and away teams (therefore doubling the data frame width) and making one row per game.
Extra data for reference Results: I am only interest in the ['match'], ['home'] and ['away'] dictionaries. Within the home and away dictionaries, the 'team_stats' part is what I am trying to extract. This data comes from an API. I turn the response into json format first. I will be processing one set of data like this per match. And will be aiming for around 250+ matches.
Try this:
# Assuming `rugby_data`` is the deserialized JSON object from your sample input
df = (
pd.DataFrame(
# The idea is to convert it to a list of 3-tuples:
# [
# ("match", "id", 3195835),
# ("match", "comp_id", 2142),
# ...
# ("home", "carries_crossed_gain_line", "52"),
# ("home", "carries_metres", "648"),
# ...
# ("away", "carries_crossed_gain_line", "59"),
# ("away", "carries_metres", "561"),
# ...
# ]
[("match", key, value) for key, value in rugby_data["match"].items()]
+ [
(side, stat["stat"], stat["value"])
for side in ["home", "away"]
for _, stats in rugby_data[side]["team_stats"].items()
for stat in stats
],
columns=["category", "key", "value"],
)
# Then transform it to a wide format
.set_index(["category", "key"])
.T
)
# Your stats are all strings. Convert them to floats for numerical operations
# later on
for (category, key) in df.columns:
if category in ["home", "away"]:
df[category, key] = df[category, key].astype(float)
# I would recommend using a MultiIndex for the columns:
# df["match", "venue"]
# df["home", "carries_crossed_gain_line"]
# df["away", "conversion_goals"]
# but if you want to, you can flatten the column levels:
df.columns = df.columns.map("_".join)
# Now you can refer to columns like:
# df["match_venue"]
# df["home_carries_crossed_gain_line"]
# df["away_conversion_goals"]