Search code examples
pythondatatable

Reading an online .tbl data file in python


Like the Title says, I am trying to read an online data file that is in .tbl format. Here is the link to the data: https://irsa.ipac.caltech.edu/data/COSMOS/tables/morphology/cosmos_morph_cassata_1.1.tbl

I tried the following code

cosmos= pd.read_table('https://irsa.ipac.caltech.edu/data/COSMOS/tables/morphology/cosmos_morph_cassata_1.1.tbl')

Running this didn't give me any errors however when I wrote print (cosmos.column), it didn't give me a list of individuals columns instead python put everything together and gave me the output that looks like:

Index(['|            ID|            RA|           DEC|  MAG_AUTO_ACS|       R_PETRO|        R_HALF|    CONC_PETRO|     ASYMMETRY|          GINI|           M20|   Axial Ratio|     AUTOCLASS|   CLASSWEIGHT|'], dtype='object').

My main goal is to print the columns of that table individually and then print cosmos['RA']. Anyone know how this can be done?


Solution

  • Your file has four header rows and different delimiters in header (|) and data (whitespace). You can read the data by using skiprows argument of read_table.

    import requests
    import pandas as pd
    
    filename = 'cosmos_morph_cassata_1.1.tbl'
    url = 'https://irsa.ipac.caltech.edu/data/COSMOS/tables/morphology/' + filename
    n_header = 4
    
    ## Download large file to disc, so we can reuse it...
    table_file = requests.get(url)
    open(filename, 'wb').write(table_file.content)
    
    
    ## Skip the first 4 header rows and use whitespace as delimiter
    cosmos = pd.read_table(filename, skiprows=n_header, header=None, delim_whitespace=True)
    
    ## create header from first line of file
    with open(filename) as f:
        header_line = f.readline()
        ## trim whitespaces and split by '|'
        header_columns = header_line.replace(' ', '').split('|')[1:-1]
    
    cosmos.columns = header_columns
    

    enter image description here