Search code examples
pythonnlphtml-parsinglxmltext-parsing

How to clean HTML string to parse it in python using lxml?


I have a python string that has HTML code in it, coming from JSON that I want to parse using lxml library. The string has several escape characters and other special characters. How to clean this code so that I can extract information from it using lxml? I want to use the XPATH selectros on the string.

String-

<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n\r\n<head>\r\n    <META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\r\n</head>\r\n\r\n<body>\r\n\r\n<div>\r\n    <table width=\"640\" align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:14px\">\r\n        <tr>\r\n            <td align=\"center\">\r\n\r\n                <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"max-width:600px;text-align:left\">\r\n                    <tr>\r\n                        <td width=\"600\">\r\n                            <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\">\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td align=\"center\">\r\n                                        <a href=\"#0.1_\"><img src=\"https://ns.yatracdn.com/common/images/emailers/corp-flight-hotel/yatra-logo.png\" width=\"101\" height=\"45\" alt=\"Yatra.com\" title=\"Yatra.com\" border=\"0\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:25px;color:#ea2330\" vspace=\"0\" hspace=\"0\" align=\"center\"></a>\r\n                                    </td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td>\r\n                                        <table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\" style=\"border:1px solid #d8d8d8\">\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td colspan=\"3\"><b>Travel Request Details</b></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td>\r\n                                                    <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"border:1px solid #d8d8d8\">\r\n                                                        <tbody>\r\n                                                        <tr>\r\n                                                            <td width=\"10\"></td>\r\n                                                            <td>\r\n                                                                <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\">\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Email Verification Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr id='aaaaa'>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Request Submission Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Product </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Flight</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Journey Type </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">One way</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Adult </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">1</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Child </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Infant </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Flight Class </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Travel Class</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Preferred Airline </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            </td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Non Stop Flight </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Airline</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Email </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">ankityadav56@demo.com</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Mobile</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">9971255462</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Travel Policy Email</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Corporate.traveler@yatra.com</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Origin</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">New Delhi(DEL)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Destination</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Mumbai(BOM)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Depart Date</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">26 Jun 2020</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Time From</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">00:23</td>\r\n                                                                    </tr>\r\n\r\n                                                                </table>\r\n                                                            </td>\r\n                                                            <td width=\"10\"></td>\r\n                                                        </tr>\r\n\r\n                                                        </tbody>\r\n                                                    </table>\r\n\r\n                                                </td>\r\n                                                <td width=\"10\"></td>\r\n                                            </tr>\r\n\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                        </table>\r\n\r\n                                    </td>\r\n                                </tr>\r\n                            </table>\r\n                        </td>\r\n                    </tr>\r\n                </table>\r\n            </td>\r\n        </tr>\r\n    </table>\r\n\r\n</div>\r\n\r\n</body>\r\n\r\n</html>

With clean string the parser works like this-

>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> tree   = etree.parse(StringIO(broken_html), parser)

>>> result = etree.tostring(tree.getroot(),
...                         pretty_print=True, method="html")
>>> print(result)
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <h1>page title</h1>
  </body>
</html>

Solution

  • Maybe you want to use BeautifulSoup? It's a framework which structures the code so you can iterate over it. You can also search for specific tags, classes and so on. Ps. One of the parser options for it is lxml.

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(broken_html, 'lxml')
    soup.titel  # returns <title>Titel</title>
    soup.find_all('div')  # returns an array with all div tags
    my_tag = soup.find(id="yourID")
    my_tag.find_all('div')  # returns you every div tag in the tag with the id yourID