python regex python-3.x pandas outlook-2010

Extracting numbers from outlook email body with Python

I get hourly email alerts that tell me how much revenue the company has made in the last hour. I want to extract this information into a pandas dataframe so that i can run some analysis on it.

My problem is that i can't figure out how to extract data from the email body in a usable format. I think i need to use regular expressions but i'm not too familiar with them.

This is what i have so far:

import os
import pandas as pd
import datetime as dt
import win32com.client

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items

#Empty Lists
email_subject = []
email_date = []
email_content = []

#find emails

for message in messages:
    if message.SenderEmailAddress == 'oracle@xyz.com' and message.Subject.startswith('Demand'):
        email_subject.append(message.Subject)
        email_date.append(message.senton.date()) 
        email_content.append(message.body)

The email_content list looks like this:

'                                                                                                                   \r\nDemand: $41,225 (-47%)\t                                                                            \r\n                                                                                                                       \r\nOrders: 515 (-53%)\t                                                                                \r\nUnits: 849 (-59%)\t                                                                                 \r\n                                                                                                                       \r\nAOV: $80 (12%)                                                                                                          \r\nAUR: $49 (30%)                                                                                                          \r\n                                                                                                                       \r\nOrders with Promo Code: 3%                                                                                              \r\nAverage Discount: 21%                                                                                             '

Can anyone tell me how i can split its contents to so that i can get the int value of Demand, Orders and Units in separate columns?

Thanks!

Solution

You could use a combination of string.split() and string.strip() to first extract each lines individually.

string = email_content
lines = string.split('\r\n')
lines_stripped = []
for line in lines:
    line = line.strip()
    if line != '':
        lines_stripped.append(line)

This gives you an array like this:

['Demand: $41,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV: $80 (12%)', 'AUR: $49 (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']

You can also achieve the same result in a more compact (pythonic) way:

lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']

Once you have this array, you use regexes as you correctly guessed to extract the values. I recommend https://regexr.com/ to experiment with your regex expressions.

After some quick experimenting, r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?' should work.

Here is the code that produces a dict from your lines_stripped we created above:

import re
regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
matched_dict = {}
for line in lines_stripped:
    match = re.match(regex, line)
    matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])

print(matched_dict)

This produces the following output:

{'AOV': ('$80', '12%)'),
 'AUR': ('$49', '30%)'),
 'Average Discount': ('21%', ''),
 'Demand': ('$41,225', '-47%)'),
 'Orders': ('515', '-53%)'),
 'Orders with Promo Code': ('3%', ''),
 'Units': ('849', '-59%)')}

You asked for Units, Orders and Demand, so here is the extraction:

# Remove the dollar sign before converting to float
# Replace , with empty string
demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
print(int(demand_string))
print(int(matched_dict['Orders'][0]))
print(int(matched_dict['Units'][0]))

As you can see, Demand is a little bit more complicated because it contains some extra characters python can't decode when converting to int.

Here is the final output of those 3 prints:

41225
515
849

Hope I answered your question ! If you have more questions about regex, I encourage you to experiement with regexr, it's very well built !

EDIT: Looks like there is a small issue in the regex causing the final ')' to be included in the last group. This does not affect your question though !