I get hourly email alerts that tell me how much revenue the company has made in the last hour. I want to extract this information into a pandas dataframe so that i can run some analysis on it.
My problem is that i can't figure out how to extract data from the email body in a usable format. I think i need to use regular expressions but i'm not too familiar with them.
This is what i have so far:
import os
import pandas as pd
import datetime as dt
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
#Empty Lists
email_subject = []
email_date = []
email_content = []
#find emails
for message in messages:
if message.SenderEmailAddress == 'oracle@xyz.com' and message.Subject.startswith('Demand'):
email_subject.append(message.Subject)
email_date.append(message.senton.date())
email_content.append(message.body)
The email_content list looks like this:
' \r\nDemand: $41,225 (-47%)\t \r\n \r\nOrders: 515 (-53%)\t \r\nUnits: 849 (-59%)\t \r\n \r\nAOV: $80 (12%) \r\nAUR: $49 (30%) \r\n \r\nOrders with Promo Code: 3% \r\nAverage Discount: 21% '
Can anyone tell me how i can split its contents to so that i can get the int value of Demand, Orders and Units in separate columns?
Thanks!
You could use a combination of string.split() and string.strip() to first extract each lines individually.
string = email_content
lines = string.split('\r\n')
lines_stripped = []
for line in lines:
line = line.strip()
if line != '':
lines_stripped.append(line)
This gives you an array like this:
['Demand: $41,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV: $80 (12%)', 'AUR: $49 (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']
You can also achieve the same result in a more compact (pythonic) way:
lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']
Once you have this array, you use regexes as you correctly guessed to extract the values. I recommend https://regexr.com/ to experiment with your regex expressions.
After some quick experimenting, r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
should work.
Here is the code that produces a dict from your lines_stripped we created above:
import re
regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
matched_dict = {}
for line in lines_stripped:
match = re.match(regex, line)
matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])
print(matched_dict)
This produces the following output:
{'AOV': ('$80', '12%)'),
'AUR': ('$49', '30%)'),
'Average Discount': ('21%', ''),
'Demand': ('$41,225', '-47%)'),
'Orders': ('515', '-53%)'),
'Orders with Promo Code': ('3%', ''),
'Units': ('849', '-59%)')}
You asked for Units, Orders and Demand, so here is the extraction:
# Remove the dollar sign before converting to float
# Replace , with empty string
demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
print(int(demand_string))
print(int(matched_dict['Orders'][0]))
print(int(matched_dict['Units'][0]))
As you can see, Demand is a little bit more complicated because it contains some extra characters python can't decode when converting to int.
Here is the final output of those 3 prints:
41225
515
849
Hope I answered your question ! If you have more questions about regex, I encourage you to experiement with regexr, it's very well built !
EDIT: Looks like there is a small issue in the regex causing the final ')' to be included in the last group. This does not affect your question though !