I have a data frame which is as follows.
Table_name | Domain | Use_case |
---|---|---|
group | Reporting | finance, marketing |
resource | Banking, Reporting | marketing, sales, finance |
customer | Reporting | sales |
Domain_df[]:
Domain_name | Domain_id |
---|---|
Reporting | 1001 |
Banking | 1002 |
Usecase_df[]
usecase_name | id |
---|---|
Finance | 2001 |
marketing | 2002 |
sales | 2003 |
Final result I want is:
Table_name | Domain | Use_case | domain_id | usecase_id |
---|---|---|---|---|
group | Reporting | finance, marketing | [1001] | [2001,2002] |
resource | Banking,Reporting | marketing, sales, finance | [1002,1001] | [2002,2003,2001] |
customer | Reporting | sales | [1002] | [2003] |
Intitially I tried splitting the column with delimiter , and forming separate columns using:
df=df.join(pandas.DataFrame(df['Verified Use case'].str.split(', ', expand=True)).add_prefix('use_case_'))
Result of which I got something like:
Table_name | Domain | Use_case | domain0 | domain1 | usecase0 | usecase1 | usecase2 |
---|---|---|---|---|---|---|---|
group | Reporting | finance, marketing | Reporting | NA | finance | marketing | NA |
resource | Banking, Reporting | marketing, sales, finance | Banking | Reporting | marketing | sales | finance |
customer | Reporting | sales | reporting | NA | sales | NA | NA |
Beyond this, I need to do a merge/lookup on ever column to the domain and use case tables to fetch the respective ids returned to the data frame in the form of a list in the column. It should be a list of the corresponding ids.
Is it a good approach to split every column into separate dynamic columns and then do a merge (in which case writing this merge statement is tricky for which I need some help)
but my question was is there any other way to achieve this directly, without splitting it?
If no, then how can we achieve dynamic merging?
PS: The no of values being entered in the columns domain and use_case is a variable.
This is a possible solution:
domain_map = domain_df.set_index('Domain_name')['Domain_id'].to_dict()
usecase_map = usecase_df.set_index('usecase_name')['id'].to_dict()
df = df.assign(
domain_id=(
df['Domain'].str.split(', ')
.map(lambda lst: list(map(domain_map.get, lst)))),
usecase_id=(
df['Use_case'].str.split(', ')
.map(lambda lst: list(map(usecase_map.get, lst)))),
)