Search code examples
pythonhtmlbeautifulsoupoutlook

Extracting Text from HTML table in Outlook


I have a need to extract the values from an HTML table that gets emailed out weekly. My loop is checking each email in the specified folder for the table as there is some existing emails with information that I would like to capture. Ultimately this data will get pushed to a pandas dataframe for additional manipulation.

The table has 4 header columns but 5 data columns which is causing all kinds of errors in pandas. This code below has gotten me closer to the end result but still having issues. The data is being populated as 1x24 in the XLSX file. I need the data in 5x5, but not sure as how I can accomplish this with the header row only being 4 columns.

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup

# email specs
outlook = win32com.client.Dispatch('outlook.application')
mapi = outlook.GetNamespace("MAPI")
inbox = mapi.GetDefaultFolder(6).Folders["TestFolder"]
messages = inbox.Items

# read email
for message in messages:
    body_content = message.HTMLBody
    bs = BeautifulSoup(body_content, features="lxml")
    table = bs.find_all("table")
    for row in table:
        r = row.find_all('span')
        df = pd.DataFrame(r)

    # save dataframe
    df.to_excel('df.xlsx')

Sample HTML table file below-

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:"Book Antiqua";
    panose-1:2 4 6 2 5 3 5 3 3 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:12.0pt;
    font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:#0563C1;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:#954F72;
    text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
    {mso-style-name:msonormal;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    font-size:12.0pt;
    font-family:"Times New Roman",serif;}
p.xl1519243, li.xl1519243, div.xl1519243
    {mso-style-name:xl1519243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7119243, li.xl7119243, div.xl7119243
    {mso-style-name:xl7119243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7219243, li.xl7219243, div.xl7219243
    {mso-style-name:xl7219243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7319243, li.xl7319243, div.xl7319243
    {mso-style-name:xl7319243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7419243, li.xl7419243, div.xl7419243
    {mso-style-name:xl7419243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7519243, li.xl7519243, div.xl7519243
    {mso-style-name:xl7519243;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl1519240, li.xl1519240, div.xl1519240
    {mso-style-name:xl1519240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7119240, li.xl7119240, div.xl7119240
    {mso-style-name:xl7119240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7219240, li.xl7219240, div.xl7219240
    {mso-style-name:xl7219240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7319240, li.xl7319240, div.xl7319240
    {mso-style-name:xl7319240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    border:none;
    padding:0in;
    font-size:10.0pt;
    font-family:"Calibri",sans-serif;
    color:black;}
p.xl7419240, li.xl7419240, div.xl7419240
    {mso-style-name:xl7419240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
p.xl7519240, li.xl7519240, div.xl7519240
    {mso-style-name:xl7519240;
    mso-margin-top-alt:auto;
    margin-right:0in;
    mso-margin-bottom-alt:auto;
    margin-left:0in;
    text-align:center;
    background:#C4BD97;
    border:none;
    padding:0in;
    font-size:8.0pt;
    font-family:"Book Antiqua",serif;
    color:black;
    font-weight:bold;}
span.EmailStyle30
    {mso-style-type:personal-compose;
    font-family:"Calibri",sans-serif;
    font-weight:bold;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-size:10.0pt;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="font-family:&quot;Calibri&quot;,sans-serif"><o:p>&nbsp;</o:p></span></b></p>
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-family:&quot;Calibri&quot;,sans-serif">Lorem<o:p></o:p></span></b></p>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-family:&quot;Calibri&quot;,sans-serif"><br>
<br>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus ullamcorper nec leo sed blandit. Etiam fringilla aliquam erat, et congue mi malesuada eget. Vestibulum elit magna, fermentum eu orci sit amet, eleifend finibus felis. Sed massa orci, dignissim vel ex sit amet, fringilla imperdiet lorem. Phasellus euismod cursus lorem. Morbi tristique sapien a feugiat rutrum. Nulla egestas fermentum lorem, eu bibendum libero tristique in. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Nulla magna ante, ultrices eu ipsum vitae, vulputate consectetur neque. Nam vel nisi euismod, pulvinar mauris non, porta felis. Nulla pellentesque libero ut semper iaculis..<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,sans-serif">&nbsp;&nbsp; <o:p></o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="0" style="width:700.0pt;border-collapse:collapse">
<tbody>
<tr style="height:15.75pt">
<td width="293" nowrap="" colspan="2" style="width:220.0pt;border:solid windowtext 1.0pt;border-right:solid black 1.0pt;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">Asset ID &amp; Name<o:p></o:p></span></b></p>
</td>
<td width="231" nowrap="" style="width:173.0pt;border:solid windowtext 1.0pt;border-left:none;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">Lead Participant<o:p></o:p></span></b></p>
</td>
<td width="117" nowrap="" style="width:88.0pt;border:solid windowtext 1.0pt;border-left:none;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">ISO Determination<o:p></o:p></span></b></p>
</td>
<td width="292" nowrap="" style="width:219.0pt;border:solid windowtext 1.0pt;border-left:none;background:#C4BD97;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="font-size:8.0pt;font-family:&quot;Book Antiqua&quot;,serif;color:black">ISO Comment<o:p></o:p></span></b></p>
</td>
</tr>
<tr style="height:15.75pt">
<td nowrap="" style="border:solid windowtext 1.0pt;border-top:none;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">65133<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:15.75pt">
<td nowrap="" style="border:solid windowtext 1.0pt;border-top:none;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">69067<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:15.75pt">
<td nowrap="" style="border:solid windowtext 1.0pt;border-top:none;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">69070<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
<td nowrap="" style="border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:.75pt .75pt 0in .75pt;height:15.75pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:black">XXXXXX<o:p></o:p></span></p>
</td>
</tr>
<tr>
<td width="67" style="width:50.0pt;padding:0in 0in 0in 0in"></td>
<td width="227" style="width:170.0pt;padding:0in 0in 0in 0in"></td>
<td width="231" style="width:173.0pt;padding:0in 0in 0in 0in"></td>
<td width="117" style="width:88.0pt;padding:0in 0in 0in 0in"></td>
<td width="292" style="width:219.0pt;padding:0in 0in 0in 0in"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-family:&quot;Calibri&quot;,sans-serif"><br>
<br>


Desired output would be:

data format


Solution

  • Assuming body_content is the sample data and that every email message only has a single table (or that the table being targeted is the first one in any given message):

    # get table
    bs = BeautifulSoup(body_content, features="html.parser")
    tables = bs.find_all("table")
    table = tables[0]  # assumes it is always the first table in any message
    
    # get table data
    records = []
    rows = table.findAll("tr")
    for row in rows:
        cells = row.findAll("td")
        row_data = []
        for cell in cells:
            val = cell.find("p", attrs={"class": "MsoNormal"})
            try:
                row_data.append(val.text.strip())
            except AttributeError:
                row_data.append(None)
        if not all(v is None for v in row_data):  # don't add empty row like end of sample data
            records.append(row_data)
    
    # since there are only four column headers, we will extract these out and add an extra one
    headers = records.pop(0)
    headers.append("Unknown Extra Column")
    
    # create pandas dataframe
    df = pd.DataFrame(records, columns=headers)
    

    Dataframe should look like this for sample data:

      Asset ID & Name Lead Participant ISO Determination ISO Comment  \
    0           65133           XXXXXX            XXXXXX      XXXXXX   
    1           69067           XXXXXX            XXXXXX      XXXXXX   
    2           69070           XXXXXX            XXXXXX      XXXXXX   
    
      Unknown Extra Column  
    0               XXXXXX  
    1               XXXXXX  
    2               XXXXXX
    

    You'll need to change the nesting of some things for your loop and such, but this works with the sample data provided.