Search code examples
pythonregexpython-3.xpandasoutlook-2010

Extracting numbers from outlook email body with Python


I get hourly email alerts that tell me how much revenue the company has made in the last hour. I want to extract this information into a pandas dataframe so that i can run some analysis on it.

My problem is that i can't figure out how to extract data from the email body in a usable format. I think i need to use regular expressions but i'm not too familiar with them.

This is what i have so far:

import os
import pandas as pd
import datetime as dt
import win32com.client

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items

#Empty Lists
email_subject = []
email_date = []
email_content = []

#find emails

for message in messages:
    if message.SenderEmailAddress == 'oracle@xyz.com' and message.Subject.startswith('Demand'):
        email_subject.append(message.Subject)
        email_date.append(message.senton.date()) 
        email_content.append(message.body)

The email_content list looks like this:

'                                                                                                                   \r\nDemand: $41,225 (-47%)\t                                                                            \r\n                                                                                                                       \r\nOrders: 515 (-53%)\t                                                                                \r\nUnits: 849 (-59%)\t                                                                                 \r\n                                                                                                                       \r\nAOV: $80 (12%)                                                                                                          \r\nAUR: $49 (30%)                                                                                                          \r\n                                                                                                                       \r\nOrders with Promo Code: 3%                                                                                              \r\nAverage Discount: 21%                                                                                             '

Can anyone tell me how i can split its contents to so that i can get the int value of Demand, Orders and Units in separate columns?

Thanks!


Solution

  • You could use a combination of string.split() and string.strip() to first extract each lines individually.

    string = email_content
    lines = string.split('\r\n')
    lines_stripped = []
    for line in lines:
        line = line.strip()
        if line != '':
            lines_stripped.append(line)
    

    This gives you an array like this:

    ['Demand: $41,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV: $80 (12%)', 'AUR: $49 (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']
    

    You can also achieve the same result in a more compact (pythonic) way:

    lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']
    

    Once you have this array, you use regexes as you correctly guessed to extract the values. I recommend https://regexr.com/ to experiment with your regex expressions.

    After some quick experimenting, r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?' should work.

    Here is the code that produces a dict from your lines_stripped we created above:

    import re
    regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
    matched_dict = {}
    for line in lines_stripped:
        match = re.match(regex, line)
        matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])
    
    print(matched_dict)
    

    This produces the following output:

    {'AOV': ('$80', '12%)'),
     'AUR': ('$49', '30%)'),
     'Average Discount': ('21%', ''),
     'Demand': ('$41,225', '-47%)'),
     'Orders': ('515', '-53%)'),
     'Orders with Promo Code': ('3%', ''),
     'Units': ('849', '-59%)')}
    

    You asked for Units, Orders and Demand, so here is the extraction:

    # Remove the dollar sign before converting to float
    # Replace , with empty string
    demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
    print(int(demand_string))
    print(int(matched_dict['Orders'][0]))
    print(int(matched_dict['Units'][0]))
    

    As you can see, Demand is a little bit more complicated because it contains some extra characters python can't decode when converting to int.

    Here is the final output of those 3 prints:

    41225
    515
    849
    

    Hope I answered your question ! If you have more questions about regex, I encourage you to experiement with regexr, it's very well built !

    EDIT: Looks like there is a small issue in the regex causing the final ')' to be included in the last group. This does not affect your question though !