Search code examples
htmlweb-scrapingbeautifulsoupurllibpdb

Cannot scrape <div id="search-container"> from pdb databank website


I am new in webscraping and trying to reach each pdb ID on a pdb webpage such as:

url= https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_entity_source_organism.rcsb_gene_name.value%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22MCF3%22%7D%2C%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22node_id%22%3A0%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22c75bbf18a058a812f5384f297528d4b6%22%7D%7D

Here, I am trying to get IDs like "3ZBF", and "4UXL".

I wrote the code below:

url='https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_entity_source_organism.rcsb_gene_name.value%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22MCF3%22%7D%2C%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22node_id%22%3A0%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22c75bbf18a058a812f5384f297528d4b6%22%7D%7D'

page = requests.get(url)

data = page.text

soup = BeautifulSoup(data, "html.parser")

tex_tag= soup.find('div',{"class":"container","id":"maincontentcontainer"})

new_line=tex_tag.find("div",{"id":"search-container"})

print(new_line)

print(tex_tag.prettify())

Here, I cannot see inside of < div id="search-container">. I've checked the html file on the web page and pdb IDS are inside of < div id="search-container" >.

Can you suggest me a solution or give me an insight on how I can solve this problem.

Thank you in advance.


Solution

  • This sites uses an API to get the results before rendering it. It comes from this url :

    POST https://www.rcsb.org/search/gql
    

    with identifier list passed in JSON input:

    import requests
    
    ids = ["3ZBF","4UXL"]
    
    r = requests.post("https://www.rcsb.org/search/gql",
        json = {
            "attributes": None,
            "identifiers": ids,
            "returnType": "entry",
            "report": "search_summary"
        })
    
    print(r.json())