Search code examples
pythonweb-scrapingbeautifulsoupgoogle-scholar

Scraping and parsing citation info from Google Scholar search results


I have a list of around 20000 article's titles and i want to scrape their citation count from google scholar. I am new to BeautifulSoup library. I have this code:

import requests
from bs4 import BeautifulSoup

query = ['Role for migratory wild birds in the global spread of avian 
 influenza H5N8','Uncoupling conformational states from activity in an 
 allosteric enzyme','Technological Analysis of the World’s Earliest 
 Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer 
 Headdress from the Early Holocene Site of Star Carr, North Yorkshire, 
 UK','Oxidative potential of PM 2.5  during Atlanta rush hour: 
 Measurements of in-vehicle dithiothreitol (DTT) activity','Primary 
 Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- 
 wrapped Graphene and Their Oxygen Reduction Activity','Relations of 
 Preschoolers Visual-Motor and Object Manipulation Skills With Executive 
 Function and Social Behavior','We Know Who Likes Us, but Not Who Competes 
 Against Us']

url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF- 
       8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

but it returns only title and url. i don't know how to get the citation information from another tag. Please help me out here.


Solution

  • You need to loop the list. You can use Session for efficiency. The below is for bs 4.7.1 which supports :contains pseudo class for finding the citation count. Looks like you can remove the h3 type selector from the css selector and just use class before the a i.e. .gs_rt a. If you don't have 4.7.1. you can use [title=Cite] + a to select citation count instead.

    import requests
    from bs4 import BeautifulSoup as bs
    
    queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
             'Uncoupling conformational states from activity in an allosteric enzyme',
             'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
             'Oxidative potential of PM 2.5  during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
             'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
             'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
             'We Know Who Likes Us, but Not Who Competes Against Us']
    
    with requests.Session() as s:
        for query in queries:
            url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
            r = s.get(url)
            soup = bs(r.content, 'lxml') # or 'html.parser'
            title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
            link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
            citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
            print(title, link, citations) 
    

    The alternative for < 4.7.1.

    with requests.Session() as s:
        for query in queries:
            url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
            r = s.get(url)
            soup = bs(r.content, 'lxml') # or 'html.parser'
            title = soup.select_one('.gs_rt a')
            if title is None:
                title = 'No title'
                link = 'No link'
            else:  
                link = title['href']
                title = title.text
            citations = soup.select_one('[title=Cite] + a')
            if citations is None:
                citations = 'No citation count'
            else:
                 citations = citations.text
            print(title, link, citations)
    

    Bottom version re-written thanks to comments from @facelessuser. Top version left for comparison:

    It would probably be more efficient to not call select_one twice in single line if statement. While the pattern building is cached, the returned tag is not cached. I personally would set the variable to whatever is returned by select_one and then, only if the variable is None, change it to No link or No title etc. It isn't as compact, but it will be more efficient

    [...]always check if if tag is None: and not just if tag:. With selectors, it isn't a big deal as they will only return tags, but if you ever do something like for x in tag.descendants: you get text nodes (strings) and tags, and an empty string will evaluate false even though it is a valid node. In that case, it is safest to to check for None