Search code examples
pythonhashmd5jsonb

How to sort jsonb column with keys ordered differently in python 3


I have two sample data frames df1 and df2 as shown below. A json object subjects column is created on selected columns and a hash column MD5 is created on subjects column on both df1 and df2.

df1= pd.DataFrame({
    "school_name":[ "ABC primary school","ABC primary school"],
    "class":[ "Year1", "Year2"],
    "id":[ "A001", "A002"],
    "name":[ "Tom" ,"James"],
    "maths":["",""],
     "physics": [66, 76],
     "chemistry":[61,75]
})

df1['subjects'] = df1.loc[:,~df1.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()

df1['hash']=df1['subjects'].apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())

The MD5 in hash columns looks like below:

df1
          school_name  class    id   name maths  physics  chemistry  \
0  ABC primary school  Year1  A001    Tom             66         61   
1  ABC primary school  Year2  A002  James             76         75   
                                                                  subjects  \
0    {"class":"Year1","name":"Tom","maths":"","physics":66,"chemistry":61}   
1  {"class":"Year2","name":"James","maths":"","physics":76,"chemistry":75}   
                               hash  
0  10e79cbff4602a2cb36b5c0ab26ab8cd  
1  19234a5bbe7499beec35e1a4410f1c80 

Now a second dataframe with different order of columns:

df2= pd.DataFrame({
    "school_name":[ "ABC primary school","ABC primary school"],
    "class":[ "Year1", "Year2"],
    "chemistry":[61,75],
    "id":[ "A001", "A002"],
    "name":[ "Tom" ,"James"],
    "maths":["",""],
    "physics": [66, 76]
     
})

df2['subjects'] = df2.loc[:,~df2.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()

df2['hash']=df2['subjects'].apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())

The MD5 in hash column for df2 looks like below:

df2
          school_name  class  chemistry    id   name maths  physics  \
0  ABC primary school  Year1         61  A001    Tom             66   
1  ABC primary school  Year2         75  A002  James             76   
                                                                  subjects  \
0    {"class":"Year1","chemistry":61,"name":"Tom","maths":"","physics":66}   
1  {"class":"Year2","chemistry":75,"name":"James","maths":"","physics":76}   
                               hash  
0  655ebff5003307e0f3d66cf7752a9ad5  
1  648ff416b7c7f3e71dc8e68022da56a6 

Here only the order of jsonb subjects differs in both dataframes due to which it gives a different MD5.

How can we generate, same MD5 signature in both data frames on the column subjects.

I tried to sort before creating hash, however this is not sorting as expected.

df1['subjects'] = df1['subjects'].apply(lambda x: json.dumps(x, sort_keys=True))
df2['subjects'] = df2['subjects'].apply(lambda x: json.dumps(x, sort_keys=True))

How could these be sorted and create same hash?


Solution

  • After converting string type dictionaries to original dictionary (json.loads), sort them by key.

    df1['subjects'] = df1['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
    
    df2['subjects'] = df2['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
    
    

    Use the astype(str) function to avoid errors while generating the hash code.

    df1['hash']=df1['subjects'].astype(str).apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
    

    Full code:

    import json
    df1= pd.DataFrame({
        "school_name":[ "ABC primary school","ABC primary school"],
        "class":[ "Year1", "Year2"],
        "id":[ "A001", "A002"],
        "name":[ "Tom" ,"James"],
        "maths":["",""],
         "physics": [66, 76],
         "chemistry":[61,75]
    })
    
    df1['subjects'] = df1.loc[:,~df1.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()
    df1['subjects'] = df1['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
    
    df1['hash']=df1['subjects'].astype(str).apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
    
    
    df2= pd.DataFrame({
        "school_name":[ "ABC primary school","ABC primary school"],
        "class":[ "Year1", "Year2"],
        "chemistry":[61,75],
        "id":[ "A001", "A002"],
        "name":[ "Tom" ,"James"],
        "maths":["",""],
        "physics": [66, 76]
         
    })
    
    df2['subjects'] = df2.loc[:,~df2.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()
    df2['subjects'] = df2['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
    df2['hash']=df2['subjects'].astype(str).apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
    
    print(df1)
    '''
    |    | school_name        | class   | id   | name   | maths   |   physics |   chemistry | subjects                                                                         | hash                             |
    |---:|:-------------------|:--------|:-----|:-------|:--------|----------:|------------:|:---------------------------------------------------------------------------------|:---------------------------------|
    |  0 | ABC primary school | Year1   | A001 | Tom    |         |        66 |          61 | {'chemistry': 61, 'class': 'Year1', 'maths': '', 'name': 'Tom', 'physics': 66}   | 81f4e76d6c230729e8e2dd8577e018e9 |
    |  1 | ABC primary school | Year2   | A002 | James  |         |        76 |          75 | {'chemistry': 75, 'class': 'Year2', 'maths': '', 'name': 'James', 'physics': 76} | a5a515d947e09c9578d86e441929dd1d |
    '''
    
    print(df2)
    '''
    |    | school_name        | class   |   chemistry | id   | name   | maths   |   physics | subjects                                                                         | hash                             |
    |---:|:-------------------|:--------|------------:|:-----|:-------|:--------|----------:|:---------------------------------------------------------------------------------|:---------------------------------|
    |  0 | ABC primary school | Year1   |          61 | A001 | Tom    |         |        66 | {'chemistry': 61, 'class': 'Year1', 'maths': '', 'name': 'Tom', 'physics': 66}   | 81f4e76d6c230729e8e2dd8577e018e9 |
    |  1 | ABC primary school | Year2   |          75 | A002 | James  |         |        76 | {'chemistry': 75, 'class': 'Year2', 'maths': '', 'name': 'James', 'physics': 76} | a5a515d947e09c9578d86e441929dd1d |
    '''
    ```