Search code examples
pythonhtmljsonweb-scrapingfinance

Webscraping Dynamic Website to Pull Recent News Article URLs


I am attempting to pull investing news articles from a dynamic website using Python. I have tried a couple of tutorials that worked for static websites, but I have had issues pulling the URL to a specific article. The code I am working with is as follows:

    from requests_html import HTMLSession
    session = HTMLSession()
    
    r = session.get('https://www.institutionalinvestor.com/search?'
    'term=&' # eventually, the term would include the words I am actively searching for
    'filters=%7B"dates":%5B"last%20week"%5D%7D') # filter to the last week, this would eventually be for the last 24 hours only

    r.html.absolute_links

Which gets me a list of the links within the page in an array format:

{'https://www.institutionalinvestor.com/Login', 'https://www.institutionalinvestor.com/display-advertising', 'http://www.ttivanguard.com/', 'https://www.riaintel.com/', 'http://interactive.institutionalinvestor.com/executive-IR-research-em/about-586KX-2742AB.html', 'https://twitter.com/iimag', 'https://myaccount.institutionalinvestor.com/Orders/SelectPackage.html', 'https://www.institutionalinvestor.com/', 'https://www.institutionalinvestor.com/Corner-Office', 'https://www.institutionalinvestor.com/Management', 'http://iimemberships.com/', 'http://www.iiconferences.com/', 'https://www.institutionalinvestor.com/Register', 'https://www.institutionalinvestor.com/cookies', 'https://www.institutionalinvestor.com/Careers', 'https://www.institutionalinvestor.com/Custom-Research', 'https://www.institutionalinvestor.com/Portfolio', 'https://www.euromoneyplc.com/modern-slavery-act-transparency-statement', 'https://www.institutionalinvestor.com/research', 'https://www.institutionalinvestor.com/Masthead', 'https://www.institutionalinvestor.com/about-thought-leadership', 'https://www.institutionalinvestor.com/Investors', 'https://www.institutionalinvestor.com/Premium', 'https://www.institutionalinvestor.com/about-us', 'https://www.institutionalinvestor.com/thought-leadership', 'https://www.institutionalinvestor.com/PrivacyPolicy', 'https://www.institutionalinvestor.com/sponsored', 'https://www.institutionalinvestor.com/Video', 'https://www.institutionalinvestor.com/How-to-Pitch-Institutional-Investor', 'https://www.institutionalinvestor.com/FAQs', 'https://www.institutionalinvestor.com/Research-FAQs', 'https://www.institutionalinvestor.com/Reprints', 'https://www.institutionalinvestor.com/TermsConditions', 'https://www.linkedin.com/company/164389', 'https://www.facebook.com/iimag', 'https://www.institutionalinvestor.com/Customer-Service', 'https://www.institutionalinvestor.com/Culture', 'https://www.institutionalinvestor.com/awards', 'https://www.institutionalinvestor.com/Research-Insight', 'http://www.sovereignwealthcenter.com/'}

But I cannot find the links to the articles themselves. When I inspect the source code, this is what I see:

<div class="search-results" role="listbox">
                        <article class="search-result" ng-repeat="article in serverData.hits.results">
                            <div class="search-result-text-ghost"></div>
                            <h2 ng-class="article|publicationClass"><a ng-href="{{article|articleHref}}">{{article|snippet:'title'|removeHtmlTags}}</a>
                            </h2>

As someone relatively new to HTML, that h2 section towards the end leads me to believe that the site is dynamic, which is where I am stuck. Any help would be appreciated. My ideal output for this question is to get the title of the article, the source (in this case "Institutional Investor"), a preview of the article (the first couple of lines or so, and the URL for the article into a dataframe that can be sent to me each morning to save time I would otherwise spend manually pulling news. I have put together the rest of the project, outside of the news pull for sites such as Institutional Investor that are not included in an API I am using.

I am open to any and all new methods, if necessary or recommended. Thank you in advance!


Solution

  • Ya it is dynamic. You could use selenium to allow the page to first render, then pull out the html like you'd normally do with a static site. Or, its all there with their api (I think even the full article is in there too but I just pulled out what you asked for):

    import requests
    import json
    import pandas as pd
    
    api = 'https://search.euromoneyapi.com/api/Search'
    
    headers= {'content-type': 'application/json',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
    
    payload = {"site":"amg_ii",
               "suggester":'true',
               "from":0,
               "size":10,
               "sort":"dates",
               "sort_order":"desc"}
    
    data = {"site":"amg_ii","suggester":True,"from":0,"size":10,"sort":"dates","sort_order":"desc"}
    
    jsonData = requests.post(api, headers=headers, data=json.dumps(data)).json()
    
    rows = []
    articles = jsonData['hits']['results']
    for article in articles:
        title = article['snippet']['title'][0]
        source = 'https://www.institutionalinvestor.com/'
        try:
            preview = article['snippet']['description'][0]
        except:
            preview = ''
        url = 'https://www.institutionalinvestor.com/article/' + article['id'].split('/')[-1] + '/' + article['fields']['url_title'][0]
       
        row = {'title':title,
               'source':source,
               'preview':preview,
               'url':url}
        rows.append(row)
        
    df = pd.DataFrame(rows)
    

    Output:

    print (df.to_string())
                                                                           title                                  source                                                                                                                                                                        preview                                                                                                                                     url
    0                                                            Who’s on Third?  https://www.institutionalinvestor.com/                                                                  Third-party claims filing service providers require due diligence for shareholder litigation outside the U.S                                                              https://www.institutionalinvestor.com/article/b1pqxvgpm3dwjb/Who-s-on-Third
    1                      First the Cyberattack Hits. Then the Insider Trading.  https://www.institutionalinvestor.com/                                                                                         Researchers share their striking evidence of pre-disclosure spikes in options trading.                        https://www.institutionalinvestor.com/article/b1pzfhkhcv70m1/First-the-Cyberattack-Hits-Then-the-Insider-Trading
    2                         Hedge Funds Featured Prominently in 2020 SPAC Boom  https://www.institutionalinvestor.com/  Nearly 13 percent of the blank check companies that filed plans to go public in 2020 were sponsored by hedge fund firms or individuals formerly associated with the industry.                         https://www.institutionalinvestor.com/article/b1pzg04d0bbvxz/Hedge-Funds-Featured-Prominently-in-2020-SPAC-Boom
    3                            The Stocks That Drove Glenview’s Major Comeback  https://www.institutionalinvestor.com/                                                             Larry Robbins’ hedge fund finished 2020 solidly positive thanks to huge gains in the final two months of the year.                            https://www.institutionalinvestor.com/article/b1pzf7qb428t3x/The-Stocks-That-Drove-Glenview-s-Major-Comeback
    4                                         Bill Ackman’s Billion-Dollar Year  https://www.institutionalinvestor.com/                                                                                                     A big short and a big SPAC fueled hefty gains for Pershing Square in 2020.                                          https://www.institutionalinvestor.com/article/b1pzgx69sxhstk/Bill-Ackman-s-Billion-Dollar-Year
    5                              Ex-Verger Interns Make NFL, ‘Bachelor’ Debuts  https://www.institutionalinvestor.com/                                                                  Verger Capital Management CIO Jim Dunn shared the inside story on former interns John Wolford and Matt James.                                 https://www.institutionalinvestor.com/article/b1pzg3qjq9xt5x/Ex-Verger-Interns-Make-NFL-Bachelor-Debuts
    6  David Einhorn’s Greenlight Capital Pulls Off a Coup in the Fourth Quarter  https://www.institutionalinvestor.com/                                                                                          The manager turned in a strong fourth quarter by sticking with his biggest positions.  https://www.institutionalinvestor.com/article/b1pyl5mtkmpt80/David-Einhorn-s-Greenlight-Capital-Pulls-Off-a-Coup-in-the-Fourth-Quarter
    7                                             Gold&#39;s 2020 Ride Explained  https://www.institutionalinvestor.com/                                                                                                                                                                                                                               https://www.institutionalinvestor.com/article/b1psmn58mppsyj/gold39s-2020-ride-explained
    8                                     The ARK Invest Takeover Battle Is Over  https://www.institutionalinvestor.com/                                                                                A new deal has “extinguished” Resolute’s option to acquire an additional stake in the ETF firm.                                     https://www.institutionalinvestor.com/article/b1pw88ldyr905m/The-ARK-Invest-Takeover-Battle-Is-Over
    9                           Investors Quickly Saw Big Gains From These SPACs  https://www.institutionalinvestor.com/                                                                                                      At least two blank-check companies surged on recent merger announcements.                           https://www.institutionalinvestor.com/article/b1pt6fl7c9dsqc/Investors-Quickly-Saw-Big-Gains-From-These-SPACs