I have a python string that has HTML code in it, coming from JSON that I want to parse using lxml library. The string has several escape characters and other special characters. How to clean this code so that I can extract information from it using lxml? I want to use the XPATH selectros on the string.
String-
<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n\r\n<head>\r\n <META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\r\n</head>\r\n\r\n<body>\r\n\r\n<div>\r\n <table width=\"640\" align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:14px\">\r\n <tr>\r\n <td align=\"center\">\r\n\r\n <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"max-width:600px;text-align:left\">\r\n <tr>\r\n <td width=\"600\">\r\n <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\">\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td align=\"center\">\r\n <a href=\"#0.1_\"><img src=\"https://ns.yatracdn.com/common/images/emailers/corp-flight-hotel/yatra-logo.png\" width=\"101\" height=\"45\" alt=\"Yatra.com\" title=\"Yatra.com\" border=\"0\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:25px;color:#ea2330\" vspace=\"0\" hspace=\"0\" align=\"center\"></a>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td>\r\n <table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\" style=\"border:1px solid #d8d8d8\">\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td width=\"10\"></td>\r\n <td colspan=\"3\"><b>Travel Request Details</b></td>\r\n </tr>\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n <tr>\r\n <td width=\"10\"></td>\r\n <td>\r\n <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"border:1px solid #d8d8d8\">\r\n <tbody>\r\n <tr>\r\n <td width=\"10\"></td>\r\n <td>\r\n <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\">\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Email Verification Date / Time </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n </tr id='aaaaa'>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Request Submission Date / Time </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Product </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Flight</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Journey Type </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">One way</td>\r\n </tr>\r\n\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Adult </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">1</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Child </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Infant </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Flight Class </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Travel Class</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Preferred Airline </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">\r\n </td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Non Stop Flight </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Airline</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Traveller Email </td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">ankityadav56@demo.com</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Traveller Mobile</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">9971255462</td>\r\n </tr>\r\n <tr>\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n Travel Policy Email</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Corporate.traveler@yatra.com</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Origin</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">New Delhi(DEL)</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Destination</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Mumbai(BOM)</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Depart Date</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">26 Jun 2020</td>\r\n </tr>\r\n\r\n <tr >\r\n <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Time From</td>\r\n <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">00:23</td>\r\n </tr>\r\n\r\n </table>\r\n </td>\r\n <td width=\"10\"></td>\r\n </tr>\r\n\r\n </tbody>\r\n </table>\r\n\r\n </td>\r\n <td width=\"10\"></td>\r\n </tr>\r\n\r\n <tr>\r\n <td height=\"10\"></td>\r\n </tr>\r\n </table>\r\n\r\n </td>\r\n </tr>\r\n </table>\r\n </td>\r\n </tr>\r\n </table>\r\n </td>\r\n </tr>\r\n </table>\r\n\r\n</div>\r\n\r\n</body>\r\n\r\n</html>
With clean string the parser works like this-
>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"
>>> parser = etree.HTMLParser()
>>> tree = etree.parse(StringIO(broken_html), parser)
>>> result = etree.tostring(tree.getroot(),
... pretty_print=True, method="html")
>>> print(result)
<html>
<head>
<title>test</title>
</head>
<body>
<h1>page title</h1>
</body>
</html>
Maybe you want to use BeautifulSoup? It's a framework which structures the code so you can iterate over it. You can also search for specific tags, classes and so on. Ps. One of the parser options for it is lxml.
from bs4 import BeautifulSoup
soup = BeautifulSoup(broken_html, 'lxml')
soup.titel # returns <title>Titel</title>
soup.find_all('div') # returns an array with all div tags
my_tag = soup.find(id="yourID")
my_tag.find_all('div') # returns you every div tag in the tag with the id yourID