Search code examples
pythonpython-3.xpandasdataframefeature-extraction

Use the column of a dataframe that has a list of dictionaries to create other columns for the dataframe


I have a column in my dataframe of type object that has values like:

for i in df3['placeholders'][:10]:

Output:
[{'type': 'experience', 'label': '0-1 Yrs'}, {'type': 'salary', 'label': '1,00,000 - 1,25,000 PA.'}, {'type': 'location', 'label': 'Chennai'}]
[{'type': 'date', 'label': '08 October - 13 October'}, {'type': 'salary', 'label': 'Not disclosed'}, {'type': 'location', 'label': 'Chennai'}]
[{'type': 'education', 'label': 'B.Com'}, {'type': 'salary', 'label': 'Not disclosed'}, {'type': 'location', 'label': 'Mumbai Suburbs, Navi Mumbai, Mumbai'}]
[{'type': 'experience', 'label': '0-2 Yrs'}, {'type': 'salary', 'label': '50,000 - 2,00,000 PA.'}, {'type': 'location', 'label': 'Chennai'}]
[{'type': 'experience', 'label': '0-1 Yrs'}, {'type': 'salary', 'label': '2,00,000 - 2,25,000 PA.'}, {'type': 'location', 'label': 'Bengaluru(JP Nagar)'}]
[{'type': 'experience', 'label': '0-3 Yrs'}, {'type': 'salary', 'label': '80,000 - 2,00,000 PA.'}, {'type': 'location', 'label': 'Hyderabad'}]
[{'type': 'experience', 'label': '0-5 Yrs'}, {'type': 'salary', 'label': 'Not disclosed'}, {'type': 'location', 'label': 'Hyderabad'}]
[{'type': 'experience', 'label': '0-1 Yrs'}, {'type': 'salary', 'label': '1,25,000 - 2,00,000 PA.'}, {'type': 'location', 'label': 'Mumbai'}]
[{'type': 'date', 'label': '08 October - 17 October'}, {'type': 'salary', 'label': 'Not disclosed'}, {'type': 'location', 'label': 'Pune(Bavdhan)'}]
[{'type': 'experience', 'label': '0-2 Yrs'}, {'type': 'salary', 'label': 'Not disclosed'}, {'type': 'location', 'label': 'Jaipur'}]
[{'type': 'experience', 'label': '0-0 Yrs'}, {'type': 'salary', 'label': '1,00,000 - 1,50,000 PA.'}, {'type': 'location', 'label': 'Delhi NCR(Sector-81 Noida)'}]

I want to add more columns to my existing dataframe by extracting features from this column such that

value of "type"= Column name

value of "label"= value under the column

The final expected output:

df.head(3)

Output:

..... experience, salary, location, date, education

..... 0-1 Yrs, 1,00,000 - 1,25,000 PA., Chennai, nan, nan
..... nan, 1,00,000 - 1,25,000 PA., Chennai, 08 October - 13 October, nan
..... nan, Not disclosed, Mumbai Suburbs, Navi Mumbai, Mumbai, nan, B.Com

The first answer worked. [EDIT 2]

Later, I tried the same code suggested in the first response for a new dataset with same issue. I got the following error:

<ipython-input-23-ad8e644044af> in <listcomp>(.0)
----> 1 new_columns = set([d['Name'] for l in dfr.RatingDistribution.values for d in l ])
      2 # Make a dict of dicts
      3 col_val_dict = {}
      4 for col_name in new_columns:
      5     col_val_dict[col_name] = {}

TypeError: 'float' object is not iterable

My Input column:

RatingDistribution
[{'Name': 'Work-Life Balance', 'count': 5}, {'Name': 'Skill Development', 'count': 5}, {'Name': 'Salary & Benefits', 'count': 5}, {'Name': 'Job Security', 'count': 5}, {'Name': 'Company Culture', 'count': 5}, {'Name': 'Career Growth', 'count': 5}, {'Name': 'Work Satisfaction', 'count': 5}]
[{'Name': 'Work-Life Balance', 'count': 4}, {'Name': 'Skill Development', 'count': 5}, {'Name': 'Salary & Benefits', 'count': 4}, {'Name': 'Job Security', 'count': 4}, {'Name': 'Company Culture', 'count': 3}, {'Name': 'Career Growth', 'count': 3}, {'Name': 'Work Satisfaction', 'count': 5}]
[{'Name': 'Work-Life Balance', 'count': 3}, {'Name': 'Skill Development', 'count': 4}, {'Name': 'Salary & Benefits', 'count': 5}, {'Name': 'Job Security', 'count': 4}, {'Name': 'Company Culture', 'count': 5}, {'Name': 'Career Growth', 'count': 4}, {'Name': 'Work Satisfaction', 'count': 4}]
[{'Name': 'Work-Life Balance', 'count': 5}, {'Name': 'Skill Development', 'count': 5}, {'Name': 'Salary & Benefits', 'count': 5}, {'Name': 'Job Security', 'count': 5}, {'Name': 'Company Culture', 'count': 5}, {'Name': 'Career Growth', 'count': 5}, {'Name': 'Work Satisfaction', 'count': 5}]
[{'Name': 'Work-Life Balance', 'count': 3}, {'Name': 'Skill Development', 'count': 5}, {'Name': 'Salary & Benefits', 'count': 3}, {'Name': 'Job Security', 'count': 3}, {'Name': 'Company Culture', 'count': 3}, {'Name': 'Career Growth', 'count': 3}, {'Name': 'Work Satisfaction', 'count': 4}]
[{'Name': 'Work-Life Balance', 'count': 3}, {'Name': 'Skill Development', 'count': 5}, {'Name': 'Salary & Benefits', 'count': 5}, {'Name': 'Job Security', 'count': 1}, {'Name': 'Company Culture', 'count': 3}, {'Name': 'Career Growth', 'count': 1}, {'Name': 'Work Satisfaction', 'count': 1}]

My code:

new_columns = set([d['Name'] for l in dfr.RatingDistribution.values for d in l ])
# Make a dict of dicts 
col_val_dict = {}
for col_name in new_columns:
    col_val_dict[col_name] = {}
    # For each column name look to see if a row has that as a type
    # If so, get the label for that dict
    # otherwise fill it with NaN
    for i,l in enumerate(dfr.placeholders.values):
        the_label = [d['count'] for d in l if d['Name'] == col_name]
        if the_label:
            col_val_dict[col_name][i] = the_label[0]
        else:
            col_val_dict[col_name][i] = np.NaN
            
# Merge this new dfa with the old one
merged_dfa = pd.concat([dfr,pd.DataFrame(col_val_dict)],axis='columns')
dfr.shape

I'm getting error in the very first line. I'm not able to figure out why it is throwing me the float error.

PLEASE HELP


Solution

  • # Get the unique types (column names)
    new_columns = set([d['type'] for l in df3.placeholders.values for d in l ])
    # Make a dict of dicts 
    col_val_dict = {}
    for col_name in new_columns:
        col_val_dict[col_name] = {}
        # For each column name look to see if a row has that as a type
        # If so, get the label for that dict
        # otherwise fill it with NaN
        for i,l in enumerate(df3.placeholders.values):
            the_label = [d['label'] for d in l if d['type'] == col_name]
            if the_label:
                col_val_dict[col_name][i] = the_label[0]
            else:
                col_val_dict[col_name][i] = np.NaN
                
    # Merge this new df with the old one
    merged_df = pd.concat([df3,pd.DataFrame(col_val_dict)],axis='columns')