Search code examples
pythonhtmlbeautifulsouplocal

HTML file parse section to csv


I am a newbie in Python. I am trying to get all the answers from the executives (mentioned in the top) of a webpage (https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0). This webpage is located on my harddrive (so no url).

So my end result would be:

Column 1  
All executives

Column 2  
all the answers

And answer should only be derived from the "question-and-answer-section".

What i tried was the following:

from bs4 import BeautifulSoup
import requests 

with open('transcript-86-855.html') as html_file:
    soup=BeautifulSoup(html_file, 'lxml')
article_qanda = soup.find('DIV', id='article_qanda'

Could someone please help me?


Solution

  • If I understand you right, you want to print two columns, one column is Name (in this case Dror Ben Asher), other column is his answer.

    For example:

    import textwrap
    from bs4 import BeautifulSoup
    
    with open('page.html', 'r') as f_in:
        soup = BeautifulSoup(f_in.read(), 'html.parser')
    
    print('{:<30} {:<70}'.format('Name', 'Answer'))
    print('-' * 101)
    for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
        txt = answer.get_text(strip=True)
    
        s = answer.find_next_sibling()
        while s:
            if s.name == 'strong' or s.find('strong'):
                break
            if s.name == 'p':
                txt += ' ' + s.get_text(strip=True)
            s = s.find_next_sibling()
    
        txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
    
        print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt))
        print()
    

    Prints:

    Name                           Answer                                                                
    -----------------------------------------------------------------------------------------------------
    Dror Ben Asher - CEO           Thank you, Scott. Its a very good question indeed in January we
                                   announced a new amendment and that amendment includes anti-TNF
                                   patients some of them not all of them, those who qualify. And we are
                                   talking about anti-TNF failures to be clear and only Remicade and
                                   Humira. The idea here was to increase very significantly the patients
                                   pooled of those potentially eligible for the study thus expediting
                                   recruitment. Did I answer your question?
    
    Dror Ben Asher - CEO           Right, this is one of most important tasks; right now the most
                                   important item here is the divestment of non-core assets. All other
                                   non-core assets, the non-core assets are those that are not within our
                                   therapeutic focus of GI and inflammation. And those are specifically
                                   RHB-103 RIZAPORT for migraine and RHB-101 which is a cardio drug.
                                   RHB-101 is a legacy drug, we have recently announced last month, we
                                   announced that we are in discussions for both of these product for
                                   out-licensing, which we hope to complete in the first half of 2015. So
                                   this is the highest priority, obviously discussion on other product,
                                   but Redhill is in the fortunate position that we are able to complete
                                   our Phase III studies with our existing results, resources and as time
                                   goes by obviously the value of the assets keeps going up. So we are in
                                   no rush to out-license everything else and so there is obviously in
                                   track.
    
    ...and so on.