Search code examples
pythonjsonsql-serverpandasflatten

pandas dataframe breaking long table to multiple tables


I have a dataframe which has a structure like below. id, name, value, employee_0_salary, employee_0_skills_0_id, employee_0_skills_1_id, employee_1_salary, employee_1_skills_0_id so on..

Basically formed from a flattened json. But the number of columns exceed 1024. I couldn't copy the structure of json here but its a nested json with lists, arrays and dictionaries.

Now I want to split the dataframe so that the id column has duplicates but empoyee details comes as columns.

id salary skills_id

How to do it in pandas?

TIA


Solution

  • maybe melt() does what you want?

    df = pd.DataFrame({
        'id': [1, 2, 1],
        'name': ['joe','sue', 'fred'],
        'employee_0_salary': [30000, 35000, 40000],
        'employee_0_skills_0_id': [101, 102, 103],
        'employee_0_skills_1_id': [103, 104, 105],
        'employee_1_salary': [32000, 36000, 37000],
        'employee_1_skills_0_id': [105, 106, 107],
        # Add other columns...
    })
        
        print(df)
       id  name  employee_0_salary  employee_0_skills_0_id  \
    0   1   joe              30000                     101   
    1   2   sue              35000                     102   
    2   1  fred              40000                     103   
    
       employee_0_skills_1_id  employee_1_salary  employee_1_skills_0_id  
    0                     103              32000                     105  
    1                     104              36000                     106  
    2                     105              37000                     107  
        
        print(df.melt(id_vars=['id','name']))
        id  name                variable  value
    0    1   joe       employee_0_salary  30000
    1    2   sue       employee_0_salary  35000
    2    1  fred       employee_0_salary  40000
    3    1   joe  employee_0_skills_0_id    101
    4    2   sue  employee_0_skills_0_id    102
    5    1  fred  employee_0_skills_0_id    103
    6    1   joe  employee_0_skills_1_id    103
    7    2   sue  employee_0_skills_1_id    104
    8    1  fred  employee_0_skills_1_id    105
    9    1   joe       employee_1_salary  32000
    10   2   sue       employee_1_salary  36000
    11   1  fred       employee_1_salary  37000
    12   1   joe  employee_1_skills_0_id    105
    13   2   sue  employee_1_skills_0_id    106
    14   1  fred  employee_1_skills_0_id    107
    

    then, to return to the original it is messy because of the duplicate in the id column. This is one way which adds a level to the row index so the pivot works, then drops it:

    print(meltdf.reset_index().pivot(index=['index','id','name'], columns='variable', values='value') \
                        .droplevel(level=0) \
                        .groupby(level=0) \
                        .transform(lambda x: sorted(x, key=lambda k: pd.isna(k))) \
                        .dropna(axis=0))
    
    variable  employee_0_salary  employee_0_skills_0_id  employee_0_skills_1_id  \
    id name                                                                       
    1  joe              30000.0                   101.0                   103.0   
    2  sue              35000.0                   102.0                   104.0   
    1  fred             40000.0                   103.0                   105.0   
    
    variable  employee_1_salary  employee_1_skills_0_id  
    id name                                              
    1  joe              32000.0                   105.0  
    2  sue              36000.0                   106.0  
    1  fred             37000.0                   107.0