Search code examples
pythonpandasdataframenlp

Extract two specified words from the dataframe and place them in a new column, then delete the rows


This is the dataframe:

data = {"Company" : [["ConsenSys"] , ["Cognizant"], ["IBM"], ["IBM"], ["Reddit, Inc"], ["Reddit, Inc"], ["IBM"]],
"skills" : [['services', 'scientist technical expertise', 'databases'], ['datacomputing tools experience', 'deep learning models', 'cloud services'], ['quantitative analytical projects', 'financial services', 'field experience'],
['filesystems server architectures', 'systems', 'statistical analysis', 'data analytics', 'workflows', 'aws cloud services'], ['aws services'], ['data mining statistics', 'statistical analysis', 'aws cloud', 'services', 'data discovery', 'visualization'], ['communication skills experience', 'services', 'manufacturing environment', 'sox compliance']]}

dff = pd.DataFrame(data)
dff
  • I need to create a new column, and I want to start by taking specific words out of the skills column.
  • The row that does not include those specific words should then be deleted.
  • Specific words: 'services', 'statistical analysis'

Expected Output:

Company skills new_col
0 [ConsenSys] [services, scientist technical expertise, databases] [services]
1 [IBM] [filesystems server architectures, systems, statistical analysis, data analytics, workflows, aws cloud services] [services, statistical analysis]
2 [Reddit, Inc] [data mining statistics, statistical analysis, aws cloud, services, data discovery, visualization] [statistical analysis]
3 [IBM] ['communication skills experience', 'services', 'manufacturing environment', 'sox compliance'] [services]

I tried quite a lot of code in an effort to extract a specific word from the one that was available on Stack Overflow, but I was unsuccessful.


Solution

  • word = ['services', 'statistical analysis']
    s1 = df['skills'].apply(lambda x: [i for i in word if i in x])
    

    output(s1):

    0                          [services]
    1                                  []
    2                                  []
    3              [statistical analysis]
    4                                  []
    5    [services, statistical analysis]
    6                          [services]
    Name: skills, dtype: object
    

    make s1 to new_col and boolean indexing

    df.assign(new_col=s1)[lambda x: x['new_col'].astype('bool')]
    

    result:

        Company skills  new_col
    0   [ConsenSys] [services, scientist technical expertise, data...   [services]
    3   [IBM]   [filesystems server architectures, systems, st...   [statistical analysis]
    5   [Reddit, Inc]   [data mining statistics, statistical analysis,...   [services, statistical analysis]
    6   [IBM]   [communication skills experience, services, ma...   [services]
    

    i think you should make more simple example