Take the example:
pd.DataFrame(
index=["r1", "r2"],
columns=["c1","c2","c3", "group_by"],
data=[
["v1",[{"x_title":"xt1","x_label":"xl1","y_title":"yt1","y_label":"yl1"},
{"x_title":"xt1","x_label":"xl1","y_title":"yt2","y_label":"yl2"},
{"x_title":"xt2","x_label":"xl2","y_title":"yt3","y_label":"yl3"}],
"v3","x"],
["v1",[{"x_title":"xt1","x_label":"xl1","y_title":"yt1","y_label":"yl1"},
{"x_title":"xt2","x_label":"xl2","y_title":"yt2","y_label":"yl2"},
{"x_title":"xt3","x_label":"xl3","y_title":"yt3","y_label":"yl3"}],
"v3","y"],
]
)
Which produces a dataframe like:
c1 c2 c3 group_by
r1 v1 [{"x_title": "xt1", "x_label": "xl1", "y_title... v3 x
r2 v1 [{"x_title": "xt1", "x_label": "xl1", "y_title... v3 y
The main point of conversation will be c2, which is a list of dictionaries. I would like to "pull out" each unique value for the k,v pair where k ~= f"{parent.group_by}_title". So if group_by
on the row == x
, each unique x_title
will have a "child" row generated that keeps all the remaining values.
The end dataframe will hopefully look similar to:
pd.DataFrame(
index=["r1","r2","r1","r1","r2","r2","r2"],
columns=["group","orig_name","c2","c3", "type"],
data=[
[None,"v1",None,"v3","parent"],
[None,"v1",None,"v3","parent"],
["xt1","v1",[{"y_title":"yt1","y_label":"yl1"},{"y_title":"yt2","y_label":"yl2"}],"v3","child"],
["xt2","v1",[{"y_title":"yt3","y_label":"yl3"}],"v3","child"],
["yt1","v1",[{"x_title":"xt1","x_label":"xl1"}],"v3","child"],
["yt1","v1",[{"x_title":"xt2","x_label":"xl2"}],"v3","child"],
["yt1","v1",[{"x_title":"xt3","x_label":"xl3"}],"v3","child"],
]
)
or like:
group orig_name c2 c3 type
r1 None v1 None v3 r1 parent
r2 None v1 None v3 r2 parent
r1 xt1 v1 [{"y_title": "yt1", "y_label": "yl1"}, {"y_tit... v3 r1 child
r1 xt2 v1 [{"y_title": "yt3", "y_label": "yl3"}] v3 r1 child
r2 yt1 v1 [{"x_title": "xt1", "x_label": "xl1"}] v3 r2 child
r2 yt1 v1 [{"x_title": "xt2", "x_label": "xl2"}] v3 r2 child
r2 yt1 v1 [{"x_title": "xt3", "x_label": "xl3"}] v3 r2 child
I'm able to do this "manually" but iterating through each row and going from there. But I would like to have a more "pandas" answer if possible. I've dabbled in ~apply(lambda ~
, explode
, using a def children(row)
function put into a ~apply(lambda ~
, and a few other methods. They are didn't quite align to what I'm interested in, and none are quite close enough to provide details.
I'm sure (as I always do) I'm somehow overthinking/looking a simple solution, just not aware of a function that will help, or just no using one of my tries correctly.
Hoping someone could explain a better way of doing this? Thank you!
There may be a more direct way of approaching this but to me it looks like you could:
c2
into rows/columnsc2
columnStep 1:
You can .explode("c2")
then pd.json_normalize
df = df.explode("c2")
c2 = pd.json_normalize(df["c2"]).set_index(df.index)
df[c2.columns] = c2
df["c2"] = None
I've assigned c2
to None
here just to simplify the output.
c1 c2 c3 group_by x_title x_label y_title y_label
r1 v1 None v3 x xt1 xl1 yt1 yl1
r1 v1 None v3 x xt1 xl1 yt2 yl2
r1 v1 None v3 x xt2 xl2 yt3 yl3
r2 v1 None v3 y xt1 xl1 yt1 yl1
r2 v1 None v3 y xt2 xl2 yt2 yl2
r2 v1 None v3 y xt3 xl3 yt3 yl3
Step 2:
.columns.get_indexer
can map column names to indexes.
>>> df["group_by"] + "_title"
r1 x_title
r1 x_title
r1 x_title
r2 y_title
r2 y_title
r2 y_title
Name: group_by, dtype: object
>>> df.columns.get_indexer(df["group_by"] + "_title")
array([4, 4, 4, 6, 6, 6])
The indexes allow you to extract from a different column on a per-row basis: How can I select a specific column from each row in a Pandas DataFrame?
This can be used to create group/key/value columns:
df["key_name"] = df["group_by"].map({"x": "y", "y": "x"}) + "_title"
df["value_name"] = df["group_by"].map({"x": "y", "y": "x"}) + "_label"
rows = range(len(df.index))
df["group"] = df.values[rows, df.columns.get_indexer(df["group_by"] + "_title")]
df["key"] = df.values[rows, df.columns.get_indexer(df["key_name"])]
df["value"] = df.values[rows, df.columns.get_indexer(df["value_name"])]
c1 c2 c3 group_by x_title x_label y_title y_label key_name value_name group key value
r1 v1 None v3 x xt1 xl1 yt1 yl1 y_title y_label xt1 yt1 yl1
r1 v1 None v3 x xt1 xl1 yt2 yl2 y_title y_label xt1 yt2 yl2
r1 v1 None v3 x xt2 xl2 yt3 yl3 y_title y_label xt2 yt3 yl3
r2 v1 None v3 y xt1 xl1 yt1 yl1 x_title x_label yt1 xt1 xl1
r2 v1 None v3 y xt2 xl2 yt2 yl2 x_title x_label yt2 xt2 xl2
r2 v1 None v3 y xt3 xl3 yt3 yl3 x_title x_label yt3 xt3 xl3
Step 3:
The parent rows:
>>> df.groupby(df.index).first()
c1 c2 c3 group_by x_title x_label y_title y_label key_name value_name group key value
r1 v1 None v3 x xt1 xl1 yt1 yl1 y_title y_label xt1 yt1 yl1
r2 v1 None v3 y xt1 xl1 yt1 yl1 x_title x_label yt1 xt1 xl1
The child rows:
I think there may be a smarter way to turn the columns back into dicts - but for now:
df["c2"] = [
{ row[f"{col}_name"]: row[col] for col in ("key", "value") }
for row in df[["key_name", "value_name", "key", "value"]].reset_index(drop=True).to_dict("records")
]
>>> df["c2"]
r1 {'y_title': 'yt1', 'y_label': 'yl1'}
r1 {'y_title': 'yt2', 'y_label': 'yl2'}
r1 {'y_title': 'yt3', 'y_label': 'yl3'}
r2 {'x_title': 'xt1', 'x_label': 'xl1'}
r2 {'x_title': 'xt2', 'x_label': 'xl2'}
r2 {'x_title': 'xt3', 'x_label': 'xl3'}
You can then .groupby("group")
and .agg()
For .agg()
you want to use list
for the c2
column and first
for every other column.
One way to write this is to map all column names to the default action you want and merge a dict of "overrides":
>>> columns = "a", "b", "c"
>>> dict.fromkeys(columns, "first")
{'a': 'first', 'b': 'first', 'c': 'first'}
>>> dict.fromkeys(columns, "first") | dict(c="list")
{'a': 'first', 'b': 'first', 'c': 'list'}
On the dataframe:
>>> df.reset_index().groupby("group").agg(dict.fromkeys(df.columns, "first") | dict(c2=list))
c1 c2 c3 group_by x_title x_label y_title y_label group key_name value_name key value
group
xt1 v1 [{'y_title': 'yt1', 'y_label': 'yl1'}, {'y_tit... v3 x xt1 xl1 yt1 yl1 xt1 y_title y_label yt1 yl1
xt2 v1 [{'y_title': 'yt3', 'y_label': 'yl3'}] v3 x xt2 xl2 yt3 yl3 xt2 y_title y_label yt3 yl3
yt1 v1 [{'x_title': 'xt1', 'x_label': 'xl1'}] v3 y xt1 xl1 yt1 yl1 yt1 x_title x_label xt1 xl1
yt2 v1 [{'x_title': 'xt2', 'x_label': 'xl2'}] v3 y xt2 xl2 yt2 yl2 yt2 x_title x_label xt2 xl2
yt3 v1 [{'x_title': 'xt3', 'x_label': 'xl3'}] v3 y xt3 xl3 yt3 yl3 yt3 x_title x_label xt3 xl3
You could then remove any columns that are no longer needed - add a parent/child identifer column to the parent/child rows and .concat
them.