Search code examples
python-3.xxml-parsingastropyurlopen

Parsing xml file from url to a astropy votable without downloading


From http://svo2.cab.inta-csic.es/theory/fps/ you can get the transmission curves for many filters used in astronomical observations. I would like to get these data by opening the url with the corresponding xml file (for each filter), parse it to astropy's votable that helps to read the table data easily.

I have managed to do this by opening the file converting it to a UTF-8 file and saving in locally as an xml. Then opening the local file works fine, as it is obvious form the following example.

However I do not want to save the file and open it again. When I tried that by doing: votable = parse(xml_file), it raises an OSError: File name too long as it takes all the file as a string.

from urllib.request import urlopen

fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
url = urlopen(fltr).read()
xml_file = url.decode('UTF-8')
with open('tmp.xml','w') as out:
    out.write(xml_file)

votable = parse('tmp.xml')
data = votable.get_first_table().to_table(use_names_over_ids=True)

print(votable)
print(data["Wavelength"])

The output in this case is:

<VOTABLE>... 1 tables ...</VOTABLE>
Wavelength
AA    
----------
12890.0
13150.0
...
18930.0
19140.0
Length = 58 rows

Solution

  • Indeed according to the API documentation, votable.parse's first argument is either a filename or a readable file-like object. It doesn't specify this exactly, but apparently the file also has to be seekable meaning that it can be read with random access.

    The HTTPResponse object returned by urlopen is indeed a file-like object with a .read() method, so in principle it might be possible to pass directly to parse(), but this is how I found out it has to be seekable:

    fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
    u = urlopen(fltr)
    >>> parse(u)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "astropy/io/votable/table.py", line 135, in parse
        _debug_python_based_parser=_debug_python_based_parser) as iterator:
      File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
        return next(self.gen)
      File "astropy/utils/xml/iterparser.py", line 157, in get_xml_iterator
        with _convert_to_fd_or_read_function(source) as fd:
      File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
        return next(self.gen)
      File "astropy/utils/xml/iterparser.py", line 63, in _convert_to_fd_or_read_function
        with data.get_readable_fileobj(fd, encoding='binary') as new_fd:
      File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
        return next(self.gen)
      File "astropy/utils/data.py", line 210, in get_readable_fileobj
        fileobj.seek(0)
    io.UnsupportedOperation: seek
    

    So you need to wrap the data in a seekable file-like object. Along the lines that @keflavich wrote you can use io.BytesIO (io.StringIO won't work as explained below).

    It turns out that there's no reason to explicitly decode the UTF-8 data to unicode. I'll spare the example, but after trying it myself it turns out parse() works on raw bytes (which I find a bit odd, but okay). So you can read the entire contents of the URL into an io.BytesIO which is just an in-memory file-like object that supports random access:

    >>> u = urlopen(fltr)
    >>> s = io.BytesIO(u.read())
    >>> v = parse(s)
    WARNING: W42: None:2:0: W42: No XML namespace specified [astropy.io.votable.tree]
    >>> v.get_first_table().to_table(use_names_over_ids=True)
    <Table masked=True length=58>
    Wavelength Transmission
        AA
     float32     float32
    ---------- ------------
       12890.0          0.0
       13150.0          0.0
           ...          ...
       18930.0          0.0
       19140.0          0.0
    

    This is, in general, the way in Python to do something with some data as though it were a file, without writing an actual file to the filesystem.

    Note, however, this won't work if the entire file can't fit in memory. In that case you still might need to write it out to disk. But if it's just for some temporary processing and you don't want to litter your disk tmp.xml like in your example, you can always use the tempfile module to, among other things, create temporary files that are automatically deleted once they're no longer in use.