Search code examples
pythonurllib

How to read a text string into the urllib data parameter?


I am following these guidelines (although they are for python2) to perform a search here, and the query I need is:

queryText = """
<?xml version="1.0" encoding="UTF-8"?>
<orgPdbQuery>
<queryType>org.pdb.query.simple.TreeEntityQuery</queryType>
<description>TaxonomyTree Search for OTHER SEQUENCES</description>
<t>1</t>
<n>694009</n>
<nodeDesc>OTHER SEQUENCES</nodeDesc>
</orgPdbQuery>
"""

I know this query is right, because when I enter it into the second link 'Sample XML Queries' (selecting 'Source Organism Browser (NCBI)', i get an output (this is just the start of it):

383 results

1Q2W:1 1QZ8:1 1SSK:1 1UJ1:1 1UK2:1 1UK3:1 1UK4:1 1UW7:1 1WNC:1 1WOF:1 1WYY:1 1XAK:1 1YO4:1 1YSY:1 1Z1I:1 1Z1J:1 1ZV7:1 1ZV8:1 1ZV8:2 1ZVA:1 1ZVB:1 2A5A:1 2A5I:1 2A5K:1 2ACF:1 2AHM:1 2AHM:2 2AJF:2 2ALV:1 2AMD:1 2AMQ:1 2BEQ:1 2BEQ:2 2BEZ:1 2BEZ:2 2BX3:1 2BX4:1 2C3S:1 2CJR:1 2CME:1 2CME:2 2CME:3 2CME:4 2D2D:1 2DD8:3 2DUC:1 2FAV:1 2FE8:1 2FXP:1 2FYG:1 2G9T:1 2GA6:1 2GDT:1 2GHV:1 2GHW:1 2GIB:1 2GRI:1 2GT7:1 2GT8:1 2GTB:1 2GX4:1 2GZ7:1 2GZ8:1 2GZ9:1 2H2Z:1 2H85:1 2HOB:1 2HSX:1 2IDY:1 2JW8:1 2JZD:1 2JZE:1 2JZF:1 2K7X:1 2K87:1 2KAF:1 2KQV:1 2KQW:1 2KYS:1 2LIZ:1 2MM4:1 2OFZ:1 2OG3:1 2OP9:1 2OZK:1 2PWX:1 2Q6G:1 2QC2:1 2

I now want to replicate this search in python, so I wrote this:

import urllib
import urllib.parse
import urllib.request

url = 'http://www.rcsb.org/pdb/rest/search'


queryText = """
<?xml version="1.0" encoding="UTF-8"?>
<orgPdbQuery>
<queryType>org.pdb.query.simple.TreeEntityQuery</queryType>
<description>TaxonomyTree Search for OTHER SEQUENCES</description>
<t>1</t>
<n>694009</n>
<nodeDesc>OTHER SEQUENCES</nodeDesc>
</orgPdbQuery>
"""

encoded_data = urllib.parse.urlencode(queryText).encode('utf-8')
req = urllib.request.Request(url)
with urllib.request.urlopen(req,data=encoded_data) as f:
        resp = f.read()
        print(resp)

I get the error:

Traceback (most recent call last):
  File "/Users/slowat/anaconda/envs/py3/lib/python3.6/urllib/parse.py", line 892, in urlencode
    raise TypeError
TypeError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate_pdbs_from_rcsb.py", line 19, in <module>
    encoded_data = urllib.parse.urlencode(queryText).encode('utf-8')
  File "/Users/slowat/anaconda/envs/py3/lib/python3.6/urllib/parse.py", line 900, in urlencode
    "or mapping object").with_traceback(tb)
  File "/Users/slowat/anaconda/envs/py3/lib/python3.6/urllib/parse.py", line 892, in urlencode
    raise TypeError
TypeError: not a valid non-string sequence or mapping obj

Could someone demonstrate how to get this code to work?

Update 1: I also tried:

url = 'http://www.rcsb.org/pdb/rest/search'
d = dict(queryType='org.pdb.query.simple.TreeEntityQuery',n='694009')
f = urllib.parse.urlencode(d)
f = f.encode('utf-8')
req = urllib.request.Request(url,f)
with urllib.request.urlopen(req) as f:
       resp = f.read()
       print(resp)

which has the output:

'Problem creating Query from XML: Content is not allowed in prolog.\nqueryType=org.pdb.query.simple.TreeEntityQuery&n=694009\n'

Solution

  • The urlencode function expects a dictionary of key: value pairs. There is no need to use this function here, since you're submitting XML directly to the service. The data parameter should be bytes, so make sure to mark your queryText as a byte sequence instead of a string (this is specific for Python 3 - the b before """ marks it as a byt sequence and not as a plain string):

    import urllib
    import urllib.parse
    import urllib.request
    
    url = 'http://www.rcsb.org/pdb/rest/search'
    
    queryText = b"""
    <?xml version="1.0" encoding="UTF-8"?>
    <orgPdbQuery>
    <queryType>org.pdb.query.simple.TreeEntityQuery</queryType>
    <description>TaxonomyTree Search for OTHER SEQUENCES</description>
    <t>1</t>
    <n>694009</n>
    <nodeDesc>OTHER SEQUENCES</nodeDesc>
    </orgPdbQuery>
    """
    
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req,data=queryText) as f:
            resp = f.read()
            print(resp)
    

    This gives the result you expect back in resp.