Search code examples
pythonjsonpandasdictionaryjson-normalize

Unstacking json dictionaries in pandas


I've got one for ya.

So I'm trying to unpack some proprietary data I'm getting from an api.

for a reproducible example once I unpack the json data, I get a dictionary that looks like this

temp = ([{"date" : "12/15/2020","order_id" : 1, "order_items" : [{"name" : "sponge", "quantity" : 2},{"name" : "soap", "quantity" : 17}]},
 {"date" : "12/14/2020","order_id" : 2, "order_items" : [{"name" : "soap", "quantity" : 4}]}]
)

I then make a dataframe using this code

df = pd.json_normalize(temp)

now what this gives me is a dataframe that looks kind of like this.

what_i_have = pd.DataFrame({
"date" : ["12/15/2020","12/14/2020"],
"order_id" : [1,2],
"order_items" : [[{'name' : 'sponge', 'quantity' : 2},{'name' : 'soap', 'quantity' : 17}],[{'name' : 'sponge', 'quantity' : 4}]]
})

Now, I see that the problem is that when I used json_normalize it didn't go down enough levels. If I do something like

pd.json_normalize(df['order_items'][0])

it returns to me a 2 row, 2 column dataframe. if I do

df['order_items'] = df['order_items'].apply(lambda x: pd.json_normalize(x))

I get a dataframe that has dataframe objects in the order_items column, that I can't quite figure out how to use.

What I want to do is to unpack the tables I make at the lower level, and make my dataframe longer. I want it to look like this

what_i_want = pd.DataFrame({
"date" : ["12/15/2020","12/15/2020","12/14/2020"],
"order_id" : [1,1,2],
"order_items.name" : ["sponge","soap","soap"],
"order_items.quantity" : [2,17,4]
})

Any suggestions on how to do that?

ps. I think the reason that json_normalize doesn't go down enough levels is because the order_items has a varying length.


Solution

  • import pandas as pd
    
    # sample data
    temp = [{'date': '12/15/2020', 'order_id': 1, 'order_items': [{'name': 'sponge', 'quantity': 2}, {'name': 'soap', 'quantity': 17}]}, {'date': '12/14/2020', 'order_id': 2, 'order_items': [{'name': 'soap', 'quantity': 4}]}]
    
    # unpack temp using the other parameters
    df = pd.json_normalize(data=temp, record_path=['order_items'], meta=['date', 'order_id'])
    
    # display(df)
         name  quantity        date order_id
    0  sponge         2  12/15/2020        1
    1    soap        17  12/15/2020        1
    2    soap         4  12/14/2020        2