I have a dataframe that looks like this
╔══════╦══════╦══════════════════════════════════╦═══════════════════════╗ ║ ID ║ Name ║ Pets ║ CareCenter ║ ╠══════╬══════╬══════════════════════════════════╬═══════════════════════╣ ║ 1 ║ John ║ [{'Name':'Alvin','Breed':'Dog'}, ║ {'Name':'PET SHOP 1'} ║ ║ ║ ║ {'Name':'Rex','Type':'Cat'}] ║ ║ ╠══════╬══════╬══════════════════════════════════╬═══════════════════════╣ ║ 10 ║ Mary ║ {'Name':'Rose','Type':'Cat'} ║ {'Name':'PET SHOP 2'} ║ ╠══════╬══════╬══════════════════════════════════╬═══════════════════════╣ ║ 1234 ║ Joe ║ {'Name':'Max','Type':'Bird'} ║ {'Name':'PET SHOP 3'} ║ ╚══════╩══════╩══════════════════════════════════╩═══════════════════════╝
I iterate over each column to check its type.
if it's not an object (e.g. string, int, etc) i leave it untouched.
If it's an object, then:
3.1) I get the first row and get the types for each column
3.2) if the column is an object and a dictionary (just one item) I get this df
╔══════╦═══════════════════════╗ ║ ID ║ CareCenter ║ ╠══════╬═══════════════════════╣ ║ 1 ║ {'Name':'PET SHOP 1'} ║ ╠══════╬═══════════════════════╣ ║ 10 ║ {'Name':'PET SHOP 2'} ║ ╠══════╬═══════════════════════╣ ║ 1234 ║ {'Name':'PET SHOP 3'} ║ ╚══════╩═══════════════════════╝
I then apply a json_normalize on the single item-dict column.
╔══════╦═══════════════════════╦═══════════════════════╗ ║ ID ║ CareCenter ║ CareCenter_Name ║ ╠══════╬═══════════════════════╬═══════════════════════╣ ║ 1 ║ {'Name':'PET SHOP 1'} ║ PET SHOP 1 ║ ╠══════╬═══════════════════════╬═══════════════════════╣ ║ 10 ║ {'Name':'PET SHOP 2'} ║ PET SHOP 2 ║ ╠══════╬═══════════════════════╬═══════════════════════╣ ║ 1234 ║ {'Name':'PET SHOP 3'} ║ PET SHOP 3 ║ ╚══════╩═══════════════════════╩═══════════════════════╝
And then join it back to the original df
on the ID column. Using the key name concatenated with the Original Column Name and dropping the original column.
╔══════╦══════╦══════════════════════════════════╦═════════════════╗ ║ ID ║ Name ║ Pets ║ CareCenter_Name ║ ╠══════╬══════╬══════════════════════════════════╬═════════════════╣ ║ 1 ║ John ║ [{'Name':'Alvin','Breed':'Dog'}, ║ PET SHOP 1 ║ ║ ║ ║ {'Name':'Rex','Type':'Cat'}] ║ ║ ╠══════╬══════╬══════════════════════════════════╬═════════════════╣ ║ 10 ║ Mary ║ {'Name':'Rose','Type':'Cat'} ║ PET SHOP 2 ║ ╠══════╬══════╬══════════════════════════════════╬═════════════════╣ ║ 1234 ║ Joe ║ {'Name':'Max','Type':'Bird'} ║ PET SHOP 3 ║ ╚══════╩══════╩══════════════════════════════════╩═════════════════╝
3.3) If the column is a list of dictionaries, I create a new df
like this, just taking the ID and the incumbent column:
╔══════╦══════════════════════════════════╗ ║ ID ║ Pets ║ ╠══════╬══════════════════════════════════╣ ║ 1 ║ [{'Name':'Alvin','Breed':'Dog'}, ║ ║ ║ {'Name':'Rex','Type':'Cat'}] ║ ╠══════╬══════════════════════════════════╣ ║ 10 ║ {'Name':'Rose','Type':'Cat'} ║ ╠══════╬══════════════════════════════════╣ ║ 1234 ║ {'Name':'Max','Type':'Bird'} ║ ╚══════╩══════════════════════════════════╝
However I'm stuck as to how to concatenate the values of each nested keys into new columns, I'm looking for something like this, after that I can take it from there and join it back to the original df
. Order of concatenated strings is not important, as long as they're in their correct column
╔══════╦═══════════╦═══════════╗ ║ ID ║ Pets_Name ║ Pets_Type ║ ╠══════╬═══════════╬═══════════╣ ║ 1 ║ Alvin,Rex ║ Dog,Cat ║ ╠══════╬═══════════╬═══════════╣ ║ 10 ║ Rose ║ Cat ║ ╠══════╬═══════════╬═══════════╣ ║ 1234 ║ Name ║ Bird ║ ╚══════╩═══════════╩═══════════╝
Here is my code so far:
for column in data_df:
if data_df.dtypes[column] == "object":
new = data_df[['Id', column]].copy()
new = new.dropna(subset = [column])
a = (new.sample(1).applymap(type) == list).all()
islist = False
for i,v in a.items():
if(v==True and i==column):
islist = True
if(islist==True):
for c in(json_normalize(new[column].sample(1).iloc[0])):
******STUCK***** new = new.join(','.join({v for x in new[column] for y in x for k,v in y.items() if k==c})) ****** STUCK ****
else:
new = new.join(json_normalize(new[column]))
new = new.drop(column,axis=1)
new = new.add_prefix(column + '_')
#data_df=data_df.join(new, on='Id',how='left')
data_df=pd.merge(data_df, new, how='left', left_on='Id', right_on=column + '_Id' )
data_df = data_df.drop(column, 1)
data_df = data_df.drop(column + '_Id', 1)
EDIT: This would be the final table I'm looking for
╔══════╦══════╦═══════════╦═══════════╦═════════════════╗ ║ ID ║ Name ║ Pets_Name ║ Pets_Type ║ CareCenter_Name ║ ╠══════╬══════╬═══════════╬═══════════╬═════════════════╣ ║ 1 ║ John ║ Alvin,Rex ║ Dog,Cat ║ PET SHOP 1 ║ ╠══════╬══════╬═══════════╬═══════════╬═════════════════╣ ║ 10 ║ Mary ║ Rose ║ Cat ║ PET SHOP 2 ║ ╠══════╬══════╬═══════════╬═══════════╬═════════════════╣ ║ 1234 ║ Joe ║ Max ║ Bird ║ PET SHOP 3 ║ ╚══════╩══════╩═══════════╩═══════════╩═════════════════╝
Thanks in advance for any pointer you could give me,
FINAL EDIT:
This is my working code with the help from @BEN_YO
for column in data_df:
#mytype = data_df.dtypes[column]
mn = data_df.sample(1).applymap(type)
mytype = mn[column].values[0]
if mytype is dict or mytype is list:
new = data_df[['Id', column]].copy()
new = new.dropna(subset = [column])
a = (new.sample(1).applymap(type) == list).all()
islist = False
for i,v in a.items():
if(v==True and i==column):
islist = True
if(islist==True):
for c in(json_normalize(new[column].sample(1).iloc[0])):
#new = new.join(','.join({v for x in new[column] for y in x for k,v in y.items() if k==c}))
new = new.join(new[column].explode().apply(pd.Series).groupby(level=0)[[c]].agg(','.join))
#print(column)
else:
new = new.join(json_normalize(new[column]))
new = new.drop(column,axis=1)
new = new.add_prefix(column + '_')
#data_df=data_df.join(new, on='Id',how='left')
data_df=pd.merge(data_df, new, how='left', left_on='Id', right_on=column + '_Id' )
data_df = data_df.drop(column, 1)
data_df = data_df.drop(column + '_Id', 1)
Try with explode
out = df.join(df['Pets'].explode().apply(pd.Series).groupby(level=0)[['Name']].agg(','.join))