Lets assume a dataframe(df) with a string column named 'message' which contains the transactional messages.
Let's also assume the contents or values in this variable 'message' be like
Now suppose I want to search whether the 'message' contains credit card transactions.
So I would search for the keywords 'credit' and 'card' and if both of the keywords were present in the message, then it will be classified as credit card transaction.
Code:
df[ (df['message'].str.contains('credit')) & (df['message'].str.contains('card')) ]
But this line of code will return me both of the above message, both (1) and (2) as both of them contains both the keywords 'credit' and 'card'.
But actually, the 1st message is clearly not a credit card transaction. It just happens to contain both of the keywords.
So can somebody help me with the line of code that will only return
the (2nd) transaction by checking for the keyword 'credit card' together and not separately?
Your sticking point has nothing to do with PANDAS; it's entirely a string-processing issue. Reduce the problem with
s = df["Message"].str
Now, you need to find "credit" followed by "card". If there is always exactly one space between the words, then simply `.contains("credit card") will solve your problem. If you have other spacing or punctuation, then you need to work more on the string.
For white-space only, you can split
the string and look for the adjacent words:
words = s.split()
for idx, word in enumerate(words[:-1]): # look for credit in all but the final word
if word == "credit" and words[idx+1] == "card":
# You found "credit card" ... process the row
If you have other punctuation, then construct the list words
to separate on punctuation and remove those characters; exactly how depends on the characters in your input, which you haven't specified.
Will that get you going?