Search code examples
pythonpandasdataframemergelookup

Pandas splitting of columns (dtype:list) followed by dynamic lookup/merge


I have a data frame which is as follows.

Table_name Domain Use_case
group Reporting finance, marketing
resource Banking, Reporting marketing, sales, finance
customer Reporting sales

Domain_df[]:

Domain_name Domain_id
Reporting 1001
Banking 1002

Usecase_df[]

usecase_name id
Finance 2001
marketing 2002
sales 2003

Final result I want is:

Table_name Domain Use_case domain_id usecase_id
group Reporting finance, marketing [1001] [2001,2002]
resource Banking,Reporting marketing, sales, finance [1002,1001] [2002,2003,2001]
customer Reporting sales [1002] [2003]

Intitially I tried splitting the column with delimiter , and forming separate columns using:

df=df.join(pandas.DataFrame(df['Verified Use case'].str.split(', ', expand=True)).add_prefix('use_case_'))

Result of which I got something like:

Table_name Domain Use_case domain0 domain1 usecase0 usecase1 usecase2
group Reporting finance, marketing Reporting NA finance marketing NA
resource Banking, Reporting marketing, sales, finance Banking Reporting marketing sales finance
customer Reporting sales reporting NA sales NA NA

Beyond this, I need to do a merge/lookup on ever column to the domain and use case tables to fetch the respective ids returned to the data frame in the form of a list in the column. It should be a list of the corresponding ids.

Is it a good approach to split every column into separate dynamic columns and then do a merge (in which case writing this merge statement is tricky for which I need some help)

but my question was is there any other way to achieve this directly, without splitting it?

If no, then how can we achieve dynamic merging?

PS: The no of values being entered in the columns domain and use_case is a variable.


Solution

  • This is a possible solution:

    domain_map = domain_df.set_index('Domain_name')['Domain_id'].to_dict()
    usecase_map = usecase_df.set_index('usecase_name')['id'].to_dict()
    
    df = df.assign(
        domain_id=(
            df['Domain'].str.split(', ')
                        .map(lambda lst: list(map(domain_map.get, lst)))),
        usecase_id=(
            df['Use_case'].str.split(', ')
                          .map(lambda lst: list(map(usecase_map.get, lst)))),
    )