Search code examples
pythonpandasrelational-database

Unpacking nested JSON data from relational database in pandas


this is my first question, so I hope I provide all the necessary details.

My Problem

I'm fairly new to handling data with python, and I fight with the conversion of a specific type of dataframe. I'm accessing data from a relational database (Baserow) via an API. The output includes (among other data types) relation data of dictionaries within lists, like so:

[{'id': 1207327, 'value': 'foo'}, {'id': 1207330, 'value': 'bar'}]

In the respective columns, some cells are filled like this, others contain empty lists like so: [].

So, the dataframe looks like this (only much bigger):

test_df = pd.DataFrame({'ID' : [1,2],
        'PROPERTY ONE': [
            [{'id': 1207327, 'value': 'foo'}, {'id': 1207330, 'value': 'bar'}],
            []
        ]})

Output:

   ID                                       PROPERTY ONE
0   1  [{'id': 1207327, 'value': 'foo'}, {'id': 12073...
1   2                                                 []

I am looking for a way to unpack this efficiently with pandas into fields containing plain lists or strings of value with some separator, e.g.

`[foo, bar]`

Question

I have build a solution with two for-loops, but this looks inefficient and clumsy. Is there any way to achieve this in a vectorized way?


What I have tried:

My for-loop-solution looks like this:

  1. over all columns to find the respective columns
  2. within each column over individual rows/cells, unpacking the lists to individual data to dataframes and converting the respective columns to lists.

The second loop looks like this:

new_column = []
    for array in column:
        tmp_dfr = pd.DataFrame.from_dict(array)
        length = tmp_dfr.empty
        if not tmp_dfr.empty:
            values = tmp_dfr['value'].tolist()
        else:
            values = []
        new_column.append(values)
    new_column = pd.Series(new_column)

As described above, I am not happy with the way this is achieved, as it is faster and more efficient to work in a vectorized way, as I understand.

I have also looked into several seemingly related questions on this site (e.g. [here](pandas DataFrame: normalize one JSON column and merge with other columns or here). But as I am at the moment just fumbling my way into pandas, I am not able to transfer the solutions to my problem. Instead, I am running into all kinds of exceptions, maybe because some of the cells contain empty lists...


Solution

  • Creating many extra dataframes is certainly inefficient because it creates extra overhead. You can just use a list comprehension instead (the simplify function). That way there's also no need to special-case the empty list. I would use pandas.Series.map for converting the elements. Using .map instead of some form of iteration makes clear that the operation doesn't depend on results from higher up in the table.

    def simplify(json_list):
        return [json_object["value"] for json_object in json_list]
    
    test_df['NEW COLUMN'] = test_df['PROPERTY ONE'].map(simplify)
    

    The output is:

       ID                                       PROPERTY ONE  NEW COLUMN
    0   1  [{'id': 1207327, 'value': 'foo'}, {'id': 12073...  [foo, bar]
    1   2                                                 []          []
    

    If you're a fan of brevity, you can use a lambda function instead of the named function:

    test_df['NEW COLUMN'] = test_df['PROPERTY ONE'].map(
        lambda json_list: [json_object["value"] for json_object in json_list]
    )