Search code examples
pythonhtmlregexweb-scrapingbeautifulsoup

Need to convert HTML string into Text through python


This is what I have! I want a code in which I pass this whole string and get only the text part from it! This is not a page this is simply a string, just like HTML page in txt extension type. Please help me out all other solutions using beautiful soup which takes URL, but this not a webpage. Any Help will be appreciated.

b'<!DOCTYPE HTML>\r\n
<html>
   \r\n
   <head>
      \r\n
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      \r\n
      <title>TalentHire - Simplified Recruiting and Staffing</title>
      \r\n
   </head>
   \r\n        \r\n
   <body leftmargin="0" rightmargin="0" topmargin="0" bottommargin="0">
      \r\n        
      <div style="width:100%; overflow:auto; float:left; margin: auto;">
         \r\n        
         <table cellpadding="0" cellspacing="0" border="0" style="width:100%; min-width:300px;">
            \r\n                        
            <tr>
               \r\n                
               <td style=" border:none;">
                  \r\n                \t
                  <table cellpadding="0" cellspacing="0" style="width:100%; min-width:280px; margin:0 auto; border:none;">
                     \r\n                        
                     <tr>
                        \r\n                            
                        <td style="font-family: calibri,sans-serif !important; font-size:15px !important; color:#333 !important; line-height:22px; border:none;">
                           \r\n                                
                           <div id="EditorSalutationID">
                              \r\n
                              <p>Position:&nbsp; Azure Architect</p>
                              \r\n\r\n
                              <p>Location: San Antonio, Texas</p>
                              \r\n\r\n
                              <p><br />\r\nResponsibilities-</p>
                              \r\n\r\n
                              <p>Customer is implementing a new POS solution and this program is all about&nbsp; doing the integration work for the new POS along with data migration and some new web app development.<br />\r\nAll the integration and web development work will be done using azure PaaS components.<br />\r\nResponsibilities are:<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp; Provide Inputs to enterprise solution Architecture<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Design secure integration solutions/Architecture<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Implement best practices when using azure components<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Work with 3rd party vendor architects on behalf of Customer to design integration solution<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Provide recommendation to optimize azure cost<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Recommendation and best practices on using various azure resources<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Hands on set up of azure components and design patterns for development teams to follow. Hands on to .Net Technologies</p>
                              \r\n\r\n
                              <p><br />\r\nResponsible for technical solutioning and design the integration Solution in AZURE. Design, develop, and construct detailed Azure architecture. Understand current state gaps and propose secured solutions to ensure roadmap can adapt to changes and integrate with existing environment or propose changes to existing environment. Work with vendors and customers to understand new solutions&rsquo; limitations and capabilities. Work with internal delivery teams to ensure solutions align with roadmap and architecture. Lead a team of engineers and developers to design and build solutions."</p>
                              \r\n\r\n
                              <p>Regards,</p>
                              \r\n\r\n
                              <p>Manish Kumar</p>
                              \r\n\r\n
                              <p><a href="http://http/" onclick="return Webmail.Widgets.Email.Message.evLinkClick(this);" rel="noopener noreferrer" target="_blank" title="This external link will open in a new window">Email-ID:[email protected]</a></p>
                              \r\n\r\n
                              <p>Desk NO:315-994-1244</p>
                              \r\n
                           </div>
                           \r\n\r\n
                           <div id="EditorSignatureID">&nbsp;</div>
                           \r\n                             
                        </td>
                        \r\n                        
                     </tr>
                     \r\n                        
                     <tr>
                        \r\n                            
                        <td style="font-family: calibri,sans-serif; font-size:14px; line-height:normal; color:#333; border:none">\r\n                                                           </td>
                        \r\n                        
                     </tr>
                     \r\n                    
                  </table>
                  \r\n                
               </td>
               \r\n            
            </tr>
            \r\n            \r\n                \t
         </table>
         \r\n        
         <p style="border:none; padding-left:10px; font-size:11px; font-family:Arial, Helvetica, sans-serif; color:#6b6c72; text-align:left; line-height:18px;text-transform: uppercase;"> To unsubscribe from future emails or to update your email preferences<a href="http://unsubscribe.idctechnologies.com/users/request_unsubscribe/217a2089eed1fd0f407ea853a29608b1cbaf9bb2/f40908d9c9fddff08cbeeb44f5678cbf48a9a840/YkgrQnRETjZscTQvT0taSDc5dzBFR0p0WXY5dmNQYjJRVDZaWnpac2Exdz0=/" style="color:#0077c5; text-decoration:underline"><b>click here </b></a>.</p>
      </div>
      \r\n<img width="1px" height="1px" alt="" src="http://clicks.mg.idctechnologies.com/o/eJwVzDsOwyAMANDTNCOyifkNLEj0GhXFJkEKRUp6f7XZ3vQ4BiL7xqVHDRrAaIOEZkWFKuVsvHM5pBSMz88HwdhU5_qVun_mMbcul6pzLHu07AkAC3CrWEKzIkTNIgmWlcEtp7RX52jd7XgKU50s_3IbpR_38gNSeihY">
   </body>
   \r\n
</html>
\r\n'

Solution

  • from bs4 import BeautifulSoup
    data = """
    b'<!DOCTYPE HTML>\r\n
    <html>
       \r\n
       <head>
          \r\n
          <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
          \r\n
          <title>TalentHire - Simplified Recruiting and Staffing</title>
          \r\n
       </head>
       \r\n        \r\n
       <body leftmargin="0" rightmargin="0" topmargin="0" bottommargin="0">
          \r\n        
          <div style="width:100%; overflow:auto; float:left; margin: auto;">
             \r\n        
             <table cellpadding="0" cellspacing="0" border="0" style="width:100%; min-width:300px;">
                \r\n                        
                <tr>
                   \r\n                
                   <td style=" border:none;">
                      \r\n                \t
                      <table cellpadding="0" cellspacing="0" style="width:100%; min-width:280px; margin:0 auto; border:none;">
                         \r\n                        
                         <tr>
                            \r\n                            
                            <td style="font-family: calibri,sans-serif !important; font-size:15px !important; color:#333 !important; line-height:22px; border:none;">
                               \r\n                                
                               <div id="EditorSalutationID">
                                  \r\n
                                  <p>Position:&nbsp; Azure Architect</p>
                                  \r\n\r\n
                                  <p>Location: San Antonio, Texas</p>
                                  \r\n\r\n
                                  <p><br />\r\nResponsibilities-</p>
                                  \r\n\r\n
                                  <p>Customer is implementing a new POS solution and this program is all about&nbsp; doing the integration work for the new POS along with data migration and some new web app development.<br />\r\nAll the integration and web development work will be done using azure PaaS components.<br />\r\nResponsibilities are:<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp; Provide Inputs to enterprise solution Architecture<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Design secure integration solutions/Architecture<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Implement best practices when using azure components<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Work with 3rd party vendor architects on behalf of Customer to design integration solution<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Provide recommendation to optimize azure cost<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Recommendation and best practices on using various azure resources<br />\r\n&middot; &nbsp; &nbsp; &nbsp; &nbsp;Hands on set up of azure components and design patterns for development teams to follow. Hands on to .Net Technologies</p>
                                  \r\n\r\n
                                  <p><br />\r\nResponsible for technical solutioning and design the integration Solution in AZURE. Design, develop, and construct detailed Azure architecture. Understand current state gaps and propose secured solutions to ensure roadmap can adapt to changes and integrate with existing environment or propose changes to existing environment. Work with vendors and customers to understand new solutions&rsquo; limitations and capabilities. Work with internal delivery teams to ensure solutions align with roadmap and architecture. Lead a team of engineers and developers to design and build solutions."</p>
                                  \r\n\r\n
                                  <p>Regards,</p>
                                  \r\n\r\n
                                  <p>Manish Kumar</p>
                                  \r\n\r\n
                                  <p><a href="http://http/" onclick="return Webmail.Widgets.Email.Message.evLinkClick(this);" rel="noopener noreferrer" target="_blank" title="This external link will open in a new window">Email-ID:[email protected]</a></p>
                                  \r\n\r\n
                                  <p>Desk NO:315-994-1244</p>
                                  \r\n
                               </div>
                               \r\n\r\n
                               <div id="EditorSignatureID">&nbsp;</div>
                               \r\n                             
                            </td>
                            \r\n                        
                         </tr>
                         \r\n                        
                         <tr>
                            \r\n                            
                            <td style="font-family: calibri,sans-serif; font-size:14px; line-height:normal; color:#333; border:none">\r\n                                                           </td>
                            \r\n                        
                         </tr>
                         \r\n                    
                      </table>
                      \r\n                
                   </td>
                   \r\n            
                </tr>
                \r\n            \r\n                \t
             </table>
             \r\n        
             <p style="border:none; padding-left:10px; font-size:11px; font-family:Arial, Helvetica, sans-serif; color:#6b6c72; text-align:left; line-height:18px;text-transform: uppercase;"> To unsubscribe from future emails or to update your email preferences<a href="http://unsubscribe.idctechnologies.com/users/request_unsubscribe/217a2089eed1fd0f407ea853a29608b1cbaf9bb2/f40908d9c9fddff08cbeeb44f5678cbf48a9a840/YkgrQnRETjZscTQvT0taSDc5dzBFR0p0WXY5dmNQYjJRVDZaWnpac2Exdz0=/" style="color:#0077c5; text-decoration:underline"><b>click here </b></a>.</p>
          </div>
          \r\n<img width="1px" height="1px" alt="" src="http://clicks.mg.idctechnologies.com/o/eJwVzDsOwyAMANDTNCOyifkNLEj0GhXFJkEKRUp6f7XZ3vQ4BiL7xqVHDRrAaIOEZkWFKuVsvHM5pBSMz88HwdhU5_qVun_mMbcul6pzLHu07AkAC3CrWEKzIkTNIgmWlcEtp7RX52jd7XgKU50s_3IbpR_38gNSeihY">
       </body>
       \r\n
    </html>
    \r\n'
    """
    
    soup = BeautifulSoup(data, 'html.parser')
    
    print(soup.text)
    

    output:

    b'
    
    
    
    TalentHire - Simplified Recruiting and Staffing
    
    
    
    
    
    
    
    
    
    
    Position:  Azure Architect
    Location: San Antonio, Texas
    
    Responsibilities-
    Customer is implementing a new POS solution and this program is all about  doing the integration work for the new POS along with data migration and some new web app development.
    All the integration and web development work will be done using azure PaaS components.
    Responsibilities are:
    ·         Provide Inputs to enterprise solution Architecture
    ·        Design secure integration solutions/Architecture
    ·        Implement best practices when using azure components
    ·        Work with 3rd party vendor architects on behalf of Customer to design integration solution
    ·        Provide recommendation to optimize azure cost
    ·        Recommendation and best practices on using various azure resources  
    ·        Hands on set up of azure components and design patterns for development teams to follow. Hands on to .Net Technologies
    
    Responsible for technical solutioning and design the integration Solution in 
    AZURE. Design, develop, and construct detailed Azure architecture. Understand current state gaps and propose secured solutions to ensure roadmap can adapt to changes and integrate with existing environment or propose changes to existing environment. Work with vendors and customers to understand new solutions’ limitations and capabilities. Work with internal delivery teams to ensure 
    solutions align with roadmap and architecture. Lead a team of engineers and developers to design and build solutions."
    Regards,
    Manish Kumar
    Email-ID:[email protected]
    Desk NO:315-994-1244
    
    
    
    
    
    
    
    
    
    
    
    
     To unsubscribe from future emails or to update your email preferencesclick here .
    
    
    
    
    
    '