Search code examples
pythonpython-3.xweb-scrapingbeautifulsoup

Web scraping table from UniProt database


I have a list of UniProt IDs and would like to use BeautifulSoup to scrap a table containing the structure information. The url I am using is as follows: https://www.uniprot.org/uniprot/P03496, with accession "P03496".

A snippet of the html code is as follows.

<div class="main-aside">
    <div class="content entry_view_content up_entry swissprot">
        <div class="section" id="structure">
            <protvista-uniprot-structure accession="P03468">
                <div class="protvista-uniprot-structure">
                    <div class="class=" protvista-uniprot-structure__table">
                        <protvista-datatable class="feature">
                            <table>...</table>
                        </protvista-datatable>
                    </div>
                </div>
            </protvista-uniprot-structure>
        </div>
    </div>
</div>

The information I require is contained between the <table>...</table> tag.

I tried

from bs4 import BeautifulSoup
import requests

url='https://www.uniprot.org/uniprot/P03468'
r=requests.get(url)
url=r.content
soup = BeautifulSoup(url,'html.parser')
soup.find("protvista-datatable", {"class": "feature"})
print(soup)

Solution

  • Content is provided dynamically and is not contained in your soup if you take a deeper look. It do not need BeautifulSoupto get data, your tabel is based on, simply use their api / rest interface to get structured data as JSON:

    import requests
    url='https://rest.uniprot.org/uniprot/P03468'
    ## fetch the json response
    data = requests.get(url).json()
    ## pick needed data e.g. 
    data['uniProtKBCrossReferences']
    

    Output

    [{'database': 'EMBL',
      'id': 'J02146',
      'properties': [{'key': 'ProteinId', 'value': 'AAA43412.1'},
       {'key': 'Status', 'value': '-'},
       {'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
     {'database': 'EMBL',
      'id': 'AF389120',
      'properties': [{'key': 'ProteinId', 'value': 'AAM75160.1'},
       {'key': 'Status', 'value': '-'},
       {'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
     {'database': 'EMBL',
      'id': 'EF467823',
      'properties': [{'key': 'ProteinId', 'value': 'ABO21711.1'},
       {'key': 'Status', 'value': '-'},
       {'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
     {'database': 'EMBL',
      'id': 'CY009446',
      'properties': [{'key': 'ProteinId', 'value': 'ABD77678.1'},
       {'key': 'Status', 'value': '-'},
       {'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
     {'database': 'EMBL',
      'id': 'K01031',
      'properties': [{'key': 'ProteinId', 'value': 'AAA43415.1'},
       {'key': 'Status', 'value': '-'},
       {'key': 'MoleculeType', 'value': 'Genomic_RNA'}]},
     {'database': 'RefSeq',
      'id': 'NP_040981.1',
      'properties': [{'key': 'NucleotideSequenceId', 'value': 'NC_002018.1'}]},
     {'database': 'PDB',
      'id': '6WZY',
      'properties': [{'key': 'Method', 'value': 'X-ray'},
       {'key': 'Resolution', 'value': '1.50 A'},
       {'key': 'Chains', 'value': 'C=181-190'}]},...]