this is my first question, so I hope I provide all the necessary details.
I'm fairly new to handling data with python, and I fight with the conversion of a specific type of dataframe. I'm accessing data from a relational database (Baserow) via an API. The output includes (among other data types) relation data of dictionaries within lists, like so:
[{'id': 1207327, 'value': 'foo'}, {'id': 1207330, 'value': 'bar'}]
In the respective columns, some cells are filled like this, others contain empty lists like so: []
.
So, the dataframe looks like this (only much bigger):
test_df = pd.DataFrame({'ID' : [1,2],
'PROPERTY ONE': [
[{'id': 1207327, 'value': 'foo'}, {'id': 1207330, 'value': 'bar'}],
[]
]})
Output:
ID PROPERTY ONE
0 1 [{'id': 1207327, 'value': 'foo'}, {'id': 12073...
1 2 []
I am looking for a way to unpack this efficiently with pandas into fields containing plain lists or strings of value
with some separator, e.g.
`[foo, bar]`
I have build a solution with two for
-loops, but this looks inefficient and clumsy. Is there any way to achieve this in a vectorized way?
My for
-loop-solution looks like this:
The second loop looks like this:
new_column = []
for array in column:
tmp_dfr = pd.DataFrame.from_dict(array)
length = tmp_dfr.empty
if not tmp_dfr.empty:
values = tmp_dfr['value'].tolist()
else:
values = []
new_column.append(values)
new_column = pd.Series(new_column)
As described above, I am not happy with the way this is achieved, as it is faster and more efficient to work in a vectorized way, as I understand.
I have also looked into several seemingly related questions on this site (e.g. [here](pandas DataFrame: normalize one JSON column and merge with other columns or here). But as I am at the moment just fumbling my way into pandas, I am not able to transfer the solutions to my problem. Instead, I am running into all kinds of exceptions, maybe because some of the cells contain empty lists...
Creating many extra dataframes is certainly inefficient because it creates extra overhead. You can just use a list comprehension instead (the simplify
function). That way there's also no need to special-case the empty list. I would use pandas.Series.map
for converting the elements. Using .map
instead of some form of iteration makes clear that the operation doesn't depend on results from higher up in the table.
def simplify(json_list):
return [json_object["value"] for json_object in json_list]
test_df['NEW COLUMN'] = test_df['PROPERTY ONE'].map(simplify)
The output is:
ID PROPERTY ONE NEW COLUMN
0 1 [{'id': 1207327, 'value': 'foo'}, {'id': 12073... [foo, bar]
1 2 [] []
If you're a fan of brevity, you can use a lambda function instead of the named function:
test_df['NEW COLUMN'] = test_df['PROPERTY ONE'].map(
lambda json_list: [json_object["value"] for json_object in json_list]
)