Search code examples
pythonpandasnumpydictionarylogic

Categorize a column using a Dictionary key - multiple values pair


I have a dictionary:

{'Consulting': {'Deloitte', 'EY', 'KPMG', 'PwC'},
'Education': {'.edu', 'College', 'University'},
'Government':{'state','.gov','city'},
'Corporate':{'corpor','consumer','care'},
...... etc.}

I have a dataframe:

 Sno  Text            column1    column2 ......
  1   Deloitte.com
  2   Texas.gov
  3   smi@EY.com
  4   UTD.edu
  5   rapper@corporate.com

 ..... etc.

I want to use the dictionary to categorize the dataframe and build a column Category, like this:

 Sno  Text                   Category       column1    column2 ......
  1   Deloitte.com           Consulting
  2   Texas.gov              Government
  3   smi@EY.com             Consulting
  4   UTD.edu                Education
  5   rapper@corporate.com   Corporate
 ..... etc.

How can I utilize the dictionary with multiple values in python to find a full phrase or part of the phrase in the Text column and categorize it? Can we also use the same logic in case 2 matches exist? What will happen then?

Also, might sound vague, but the reason I am using Dictionary is that we can map multiple values to one category, is there a better way to do it without the dictionary?


Solution

  • IIUC after re-create your dict do with findall, then map it back

    newdict = {i: k for k, v in d.items() for i in v}
    df.Text.str.findall('|'.join(newdict.keys())).str[0].map(newdict)
    Out[431]: 
    0    Consulting
    1    Government
    2    Consulting
    3     Education
    4     Corporate
    Name: Text, dtype: object
    
    df['cate']=df.Text.str.findall('|'.join(newdict.keys())).str[0].map(newdict)