Search code examples
python-3.xpandasdataframetwitterdata-preprocessing

How to make list of words that are not in another dataframe


I have a problem with python pandas dataframe problem. I have two dataframes with different contents. I want to output words that are not in dataframe 2 and store them on a new dataframe. Can someone help me in solving this problem using python pandas dataframe? Thankyouu...

Where dataframe 1 contains:
Tweet
Bismillah for tomorrow Amin
shared location
Replying to shahrilPng
It's time to finish what's been pending
up and parallel
When you run after your dream

And dataframe 2 contains:
Words
tomorrow
shared
location
time
finish
pending
parallel
run
after
dream

The output that i want
Results
Bismillah
for
Amin
Replying
to
shahrilPng
etc


Solution

  • Split and explode your tweets dataframe and check if each words is present in your words dataframe:

    # check function
    not_in_list = lambda x: ~x.str.casefold().isin(df2['Words'].str.casefold())
    
    out = df1['Tweet'].str.split().explode().loc[not_in_list] \
                      .drop_duplicates().reset_index(drop=True).to_frame('Results')
    print(out)
    
    # Output
           Results
    0    Bismillah
    1          for
    2         Amin
    3     Replying
    4           to
    5   shahrilPng
    6         It's
    7       what's
    8         been
    9           up
    10         and
    11        When
    12         you
    13        your