I have two sample data frames df1
and df2
as shown below. A json object subjects
column is created on selected columns and a hash
column MD5
is created on subjects
column on both df1
and df2
.
df1= pd.DataFrame({
"school_name":[ "ABC primary school","ABC primary school"],
"class":[ "Year1", "Year2"],
"id":[ "A001", "A002"],
"name":[ "Tom" ,"James"],
"maths":["",""],
"physics": [66, 76],
"chemistry":[61,75]
})
df1['subjects'] = df1.loc[:,~df1.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()
df1['hash']=df1['subjects'].apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
The MD5
in hash
columns looks like below:
df1
school_name class id name maths physics chemistry \
0 ABC primary school Year1 A001 Tom 66 61
1 ABC primary school Year2 A002 James 76 75
subjects \
0 {"class":"Year1","name":"Tom","maths":"","physics":66,"chemistry":61}
1 {"class":"Year2","name":"James","maths":"","physics":76,"chemistry":75}
hash
0 10e79cbff4602a2cb36b5c0ab26ab8cd
1 19234a5bbe7499beec35e1a4410f1c80
Now a second dataframe with different order of columns:
df2= pd.DataFrame({
"school_name":[ "ABC primary school","ABC primary school"],
"class":[ "Year1", "Year2"],
"chemistry":[61,75],
"id":[ "A001", "A002"],
"name":[ "Tom" ,"James"],
"maths":["",""],
"physics": [66, 76]
})
df2['subjects'] = df2.loc[:,~df2.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()
df2['hash']=df2['subjects'].apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
The MD5
in hash
column for df2 looks like below:
df2
school_name class chemistry id name maths physics \
0 ABC primary school Year1 61 A001 Tom 66
1 ABC primary school Year2 75 A002 James 76
subjects \
0 {"class":"Year1","chemistry":61,"name":"Tom","maths":"","physics":66}
1 {"class":"Year2","chemistry":75,"name":"James","maths":"","physics":76}
hash
0 655ebff5003307e0f3d66cf7752a9ad5
1 648ff416b7c7f3e71dc8e68022da56a6
Here only the order of jsonb subjects
differs in both dataframes due to which it gives a different MD5
.
How can we generate, same MD5
signature in both data frames on the column subjects
.
I tried to sort before creating hash, however this is not sorting as expected.
df1['subjects'] = df1['subjects'].apply(lambda x: json.dumps(x, sort_keys=True))
df2['subjects'] = df2['subjects'].apply(lambda x: json.dumps(x, sort_keys=True))
How could these be sorted and create same hash?
After converting string type dictionaries to original dictionary (json.loads), sort them by key.
df1['subjects'] = df1['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
df2['subjects'] = df2['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
Use the astype(str) function to avoid errors while generating the hash code.
df1['hash']=df1['subjects'].astype(str).apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
Full code:
import json
df1= pd.DataFrame({
"school_name":[ "ABC primary school","ABC primary school"],
"class":[ "Year1", "Year2"],
"id":[ "A001", "A002"],
"name":[ "Tom" ,"James"],
"maths":["",""],
"physics": [66, 76],
"chemistry":[61,75]
})
df1['subjects'] = df1.loc[:,~df1.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()
df1['subjects'] = df1['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
df1['hash']=df1['subjects'].astype(str).apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
df2= pd.DataFrame({
"school_name":[ "ABC primary school","ABC primary school"],
"class":[ "Year1", "Year2"],
"chemistry":[61,75],
"id":[ "A001", "A002"],
"name":[ "Tom" ,"James"],
"maths":["",""],
"physics": [66, 76]
})
df2['subjects'] = df2.loc[:,~df2.columns.isin(['id','school_name'])].to_json(orient='records', lines=True, date_format='iso').splitlines()
df2['subjects'] = df2['subjects'].apply(lambda x: dict(sorted(json.loads(x).items())))
df2['hash']=df2['subjects'].astype(str).apply(lambda x: hashlib.md5(x.encode("utf-8")).hexdigest())
print(df1)
'''
| | school_name | class | id | name | maths | physics | chemistry | subjects | hash |
|---:|:-------------------|:--------|:-----|:-------|:--------|----------:|------------:|:---------------------------------------------------------------------------------|:---------------------------------|
| 0 | ABC primary school | Year1 | A001 | Tom | | 66 | 61 | {'chemistry': 61, 'class': 'Year1', 'maths': '', 'name': 'Tom', 'physics': 66} | 81f4e76d6c230729e8e2dd8577e018e9 |
| 1 | ABC primary school | Year2 | A002 | James | | 76 | 75 | {'chemistry': 75, 'class': 'Year2', 'maths': '', 'name': 'James', 'physics': 76} | a5a515d947e09c9578d86e441929dd1d |
'''
print(df2)
'''
| | school_name | class | chemistry | id | name | maths | physics | subjects | hash |
|---:|:-------------------|:--------|------------:|:-----|:-------|:--------|----------:|:---------------------------------------------------------------------------------|:---------------------------------|
| 0 | ABC primary school | Year1 | 61 | A001 | Tom | | 66 | {'chemistry': 61, 'class': 'Year1', 'maths': '', 'name': 'Tom', 'physics': 66} | 81f4e76d6c230729e8e2dd8577e018e9 |
| 1 | ABC primary school | Year2 | 75 | A002 | James | | 76 | {'chemistry': 75, 'class': 'Year2', 'maths': '', 'name': 'James', 'physics': 76} | a5a515d947e09c9578d86e441929dd1d |
'''
```