Search code examples
pythonpandasdataframelistfindall

Check if Python list elements are in a Pandas dataframe row and append each unique occurrence of the list values to a new column


My question is similar to this question raised However, The solution provided appears to work where only one element of the list will exist in the dataframe column.

My issue is that multiple values from the list might exist in each row of the df and the list values might be referenced more than once.

I have a dataframe which includes a number of job listings. The Job description column contains the full job description. To extract the key skill requirements from the job description column a number of additional columns were created. eg. Programming_Languages,Data visualisation etc

Each column has a list of values to check for there occurrence in the job description eg.

pl_list=['Python','RStudio','C#','HTML5']
dv_list=['Power BI,Tableau']

My current solution:

df2['Programming_Languages'] = df2['Programming_Languages'].str.findall('|'.join(pl_list),re.IGNORECASE)
    df2['Data_Visualisation'] = df2['Data_Visualisation'].str.findall('|'.join(dv_list),re.IGNORECASE)

The job description might reference the values in a list multiple times. so if python and Rstudio are referenced more than once it will return each occurrence. I would just like it to return the value once.

Code to recreate df

df=[['ABC','ABC','Python, C#, Power BI Tableau python Power BI'],['ABC','ABC','Power BI Python'],['ABC','ABC','Rstudio Rstudio Rstudio Tableau'],['ABC','ABC','C# C# C# HTML5']]
df = pd.DataFrame(df, columns=['Companyname', 'Jobtitle','Job_description'])

df2 = df.reindex(df.columns.tolist() + ['Programming_Languages','Data_Visualisation'],axis=True)

My current result:

My current result

My desired result:

My desired result


Solution

  • Here's a way to do it if you don't mind the case of the elements in the new columns:

    skills = {
        'Programming_Languages': ['Python','C#','Rstudio','HTML5'],
        'Data_visualization': ['Power BI','Tableau']
    }
    df[list(skills)] = pd.DataFrame([
        df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
            .apply(lambda x: pd.Series(x).str.lower().unique() if len(x) else [])
        for skill in skills], index=list(skills)).T
    

    Output:

      Companyname Jobtitle                                    Job_description Programming_Languages   Data_visualization
    0         ABC      ABC  Python,C#,Power BI,Tableau,python,Power BI,Pow...          [python, c#]  [power bi, tableau]
    1         ABC      ABC                                    Power BI,Python              [python]           [power bi]
    2         ABC      ABC            Rstudio,Rstudio,Rstudio,Tableau,Tableau             [rstudio]            [tableau]
    3         ABC      ABC                                     C#,C#,C#,HTML5           [c#, html5]                   []
    

    Explanation:

    • Use the same code as in OP to get the list of matches with duplicates
    • Use Series.apply() with a lambda that, for each Series value (namely, for each list in the Series), creates a Series from the list and uses .str.lower().unique() to de-dupe the list after first making all its items lowercase
    • Check in advance to see if the list is empty, in which case skip the above logic (to avoid an error on .str) and just supply an empty list
    • Use the list comprehension ending in for skill in skills to do the above for each item in the skills dict (which contains an entry for Programming_Languages and another for Data_visualization, and could be extended to include additional skills if ever needed)
    • Wrap the results in DataFrame() with index argument set to the keys in skills, transpose it, and assign to new columns in the input dataframe named after the keys in skills.

    If you want the results to contain the search terms with their original case, you can do this:

    skills = {
        'Programming_Languages': ['Python','C#','Rstudio','HTML5'],
        'Data_visualization': ['Power BI','Tableau']
    }
    skills_de_lower = {key: {x.lower() : x for x in skill_list} for key, skill_list in skills.items()}
    df[list(skills)] = pd.DataFrame([
        df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
            .apply(lambda x: [y[0] for y in 
                pd.DataFrame(pd.Series(x).str.lower().unique())
                    .applymap(lambda x: skills_de_lower[skill][x]).to_numpy()] 
            if len(x) else [])
        for skill in skills], index=list(skills)).T
    

    Output:

      Companyname Jobtitle                                    Job_description Programming_Languages   Data_visualization
    0         ABC      ABC  Python,C#,Power BI,Tableau,python,Power BI,Pow...          [Python, C#]  [Power BI, Tableau]
    1         ABC      ABC                                    Power BI,Python              [Python]           [Power BI]
    2         ABC      ABC            Rstudio,Rstudio,Rstudio,Tableau,Tableau             [Rstudio]            [Tableau]
    3         ABC      ABC                                     C#,C#,C#,HTML5           [C#, HTML5]                   []
    

    Explanation:

    • Similar to the lowercase answer above, but introduce a dict skills_de_lower to map from the lowercase skill values used to de-dupe, back to the original skill values we searched for
    • Use a DataFrame wrapper, applymap() and y[0] for y in ... within the list comprehension to do the mapping work and some list-of-list unpacking to get results in the same form as before (a list of strings).

    If you want csv values instead of arrays, you can add .str.join(',') just before for skill in skills like this:

    df[list(skills)] = pd.DataFrame([
        df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
            .apply(lambda x: [y[0] for y in 
                pd.DataFrame(pd.Series(x).str.lower().unique())
                    .applymap(lambda x: skills_de_lower[skill][x]).to_numpy()] 
            if len(x) else []).str.join(',')
        for skill in skills], index=list(skills)).T
    

    Output:

      Companyname Jobtitle                                    Job_description Programming_Languages Data_visualization
    0         ABC      ABC  Python,C#,Power BI,Tableau,python,Power BI,Pow...             Python,C#   Power BI,Tableau
    1         ABC      ABC                                    Power BI,Python                Python           Power BI
    2         ABC      ABC            Rstudio,Rstudio,Rstudio,Tableau,Tableau               Rstudio            Tableau
    3         ABC      ABC                                     C#,C#,C#,HTML5              C#,HTML5
    

    Explanation:

    • When we use .str.join(','), the .str part is actually acting on the list in each value (here str does not mean what its name implies, but something closer to "sequence", which in our case means the list)
    • See the documentation for .str.join() for an example of how this works only on lists of strings.