Search code examples
pythonpandasdictionarydata-wrangling

converting a dictionary of nested lists into a row in a data frame


I have some data (shown below), which is a dictionary of nested lists of dictionaries. I want to make the whole dictionary into one row. A very wide row. At present I can get my desired result. It is quite long and not very elegant. I'd like to get better at writing more concise code, and hopefully more computationally efficient code.

For context I am doing this for a university project. We aren't getting marked on code elegance or optimisation. This is more for my own benefit.

Being still quite new to python I am not great at working with a sort of nested dictionary of dictionaries and lists of nested dictionaries. Any help would be much appreciated!

Data dictionary

{'attack': [{'stat': 'carries_crossed_gain_line', 'value': '52'},
  {'stat': 'carries_metres', 'value': '648'},
  {'stat': 'carries_not_made_gain_line', 'value': '64'},
  {'stat': 'clean_breaks', 'value': '21'},
  {'stat': 'defenders_beaten', 'value': '25'},
  {'stat': 'offload', 'value': '16'},
  {'stat': 'passes', 'value': '168'},
  {'stat': 'runs', 'value': '138'},
  {'stat': 'turnovers_conceded', 'value': '16'}],
 'defence': [{'stat': 'missed_tackles', 'value': '32'},
  {'stat': 'tackles', 'value': '125'},
  {'stat': 'turnovers_won', 'value': '6'}],
 'discipline': [{'stat': 'penalties_conceded', 'value': '12'},
  {'stat': 'red_card_second_yellow', 'value': '0'},
  {'stat': 'red_cards', 'value': '0'},
  {'stat': 'yellow_cards', 'value': '0'}],
 'kicking': [{'stat': 'conversion_goals', 'value': '3'},
  {'stat': 'kicks_from_hand', 'value': '10'},
  {'stat': 'missed_conversion_goals', 'value': '1'},
  {'stat': 'missed_penalty_goals', 'value': '0'},
  {'stat': 'penalty_goals', 'value': '2'}],
 'breakdown': [{'stat': 'mauls_lost', 'value': '0'},
  {'stat': 'mauls_total', 'value': '6'},
  {'stat': 'mauls_won', 'value': '6'},
  {'stat': 'mauls_won_penalty', 'value': '1'},
  {'stat': 'mauls_won_try', 'value': '0'},
  {'stat': 'rucks_lost', 'value': '11'},
  {'stat': 'rucks_total', 'value': '99'},
  {'stat': 'rucks_won', 'value': '88'}],
 'lineouts': [{'stat': 'lineout_success', 'value': '0.93'},
  {'stat': 'lineout_won_steal', 'value': '1'},
  {'stat': 'lineouts_Lost', 'value': '1'},
  {'stat': 'lineouts_won', 'value': '14'}],
 'scrums': [{'stat': 'scrums_lost', 'value': '0'},
  {'stat': 'scrums_success', 'value': '1.00'},
  {'stat': 'scrums_won', 'value': '2'}],
 'possession': [{'stat': 'possession', 'value': '0.50'},
  {'stat': 'pc_possession_first', 'value': '0.50'},
  {'stat': 'pc_possession_second', 'value': '0.50'},
  {'stat': 'ball_possession_last_10_mins', 'value': '0.61'}]}

Desired outcome a wide row like: enter image description here

What I've tried I made a function that I can use to call each nested DF like so:

def trim(df):
    x = df.transpose()
    x.columns = x.iloc[0]
    x = x[1:]
    
    return x

Which I call in this manner. altering the last 'attack' to the other stats:

trim(pd.DataFrame(results['home']['team_stats']['attack']))

Then I

  1. concatenate the frames into one row.
  2. Add in the 'home' prefix to all the columns.
  3. Repeat steps one and two for the away team, adding the prefix 'away' to the columns
  4. concatenate both results to a very, very wide row.
  5. Add the match details meta data to the row, making it very, very, very wide.
  6. Repeat this for every set of match data.

This is time consuming. Is there a way that I can do this in a more efficient or pythonic manner? My end goal is to collect this match data for the home and away teams (therefore doubling the data frame width) and making one row per game.

Extra data for reference Results: I am only interest in the ['match'], ['home'] and ['away'] dictionaries. Within the home and away dictionaries, the 'team_stats' part is what I am trying to extract. This data comes from an API. I turn the response into json format first. I will be processing one set of data like this per match. And will be aiming for around 250+ matches.

data


Solution

  • Try this:

    # Assuming `rugby_data`` is the deserialized JSON object from your sample input
    df = (
        pd.DataFrame(
            # The idea is to convert it to a list of 3-tuples:
            #   [
            #       ("match", "id", 3195835),
            #       ("match", "comp_id", 2142),
            #       ...
            #       ("home", "carries_crossed_gain_line", "52"),
            #       ("home", "carries_metres", "648"),
            #       ...
            #       ("away", "carries_crossed_gain_line", "59"),
            #       ("away", "carries_metres", "561"),
            #       ...
            #   ]
            [("match", key, value) for key, value in rugby_data["match"].items()]
            + [
                (side, stat["stat"], stat["value"])
                for side in ["home", "away"]
                for _, stats in rugby_data[side]["team_stats"].items()
                for stat in stats
            ],
            columns=["category", "key", "value"],
        )
        # Then transform it to a wide format
        .set_index(["category", "key"])
        .T
    )
    
    # Your stats are all strings. Convert them to floats for numerical operations
    # later on
    for (category, key) in df.columns:
        if category in ["home", "away"]:
            df[category, key] = df[category, key].astype(float)
    
    # I would recommend using a MultiIndex for the columns:
    #   df["match", "venue"]
    #   df["home", "carries_crossed_gain_line"]
    #   df["away", "conversion_goals"]
    # but if you want to, you can flatten the column levels:
    df.columns = df.columns.map("_".join)
    
    # Now you can refer to columns like:
    #   df["match_venue"]
    #   df["home_carries_crossed_gain_line"]
    #   df["away_conversion_goals"]