Search code examples
google-sheetsanalysis

exclude certain text from keyword analysis in google sheets


I'm trying to do a little bit of analysis on the topics of emails I receive. I have the emails in a Google-sheet in the format below. I'm trying to count how often 'privacy' or 'confidentiality' are mentioned. My challenge is that pretty much every email signature mentions one of those words, so when i use SEARCH every cell returns TRUE.

Most email signatures start with similar phrases, so I tried deleting anything after those phrases with this formula:

=ArrayFormula(TRIM(LEFT(B1:B,MIN(IFERROR(FIND({" This email and any","IMPORTANT NOTICE", " Important notice","The information in this email"," The contents of this message"," Information in this email including"," This electronic mail message"," this message and any attachments"," This message is intended for the addressee only"," This email is CONFIDENTIAL"},B1:B),LEN(L2))))))

Column B is the column with the email body text in.

However that seems to be deleting text that follows words that aren't in my search (deleting everything after 'not' instead of 'IMPORTANT NOTICE' for instance).

Could anyone advise on either:

  1. what's wrong with my above search
  2. an alternate way of searching for 'privacy' and 'confidentiality' without including text from email signatures.

Example table: |email title|email body| |-----------|----------| |Do you want to buy my stuff| Hi there, I'd like to know if you'd like to buy this thing I want to sell you. IMPORTANT: this email is private| |two-for-the-price-of-one| I've a great offer for you! This email and attachments are private & confidential| |Last chance to buy stuff!| Can we have a private call about whether you want to buy my stuff yet?|

In the example above I want to count row 3, but not rows 1 & 2, as the 'private' and 'confidential' mentions in 1 & 2 are in the signature.

Thanks!


Solution

  • I think I understand the error that you've described is occuring with your formula. Once the formula finds one of the values you are using to try to identify an email signature, such as " Important notice", and returns the location of that text, let's say position 96, it then uses 96 for all of the cells, like this: LEFT(B1:B,96). So you might not be able to do the compound arrayformula of an arrayformula that you are trying.

    Using the formula like this, in B2, and dragging it down, should work though:

    =ArrayFormula(TRIM(LEFT(B2,MIN(IFERROR(
      FIND({" This email and any","IMPORTANT NOTICE", " Important notice","The information in this email"," The contents of this message"," Information in this email including"," This electronic mail message"," this message and any attachments"," This message is intended for the addressee only"," This email is CONFIDENTIAL"},B2),
      LEN(L2))))))
    

    Note: I'm not sure what value is in your L2.

    But for the overall approach, it really depends on how well your terms to identify email signatures work, so as to exclude them from your final full text searches.