Search code examples
curlnetcdfapache-tikafits

Text extraction for FITS similar to NetCDF?


I'm working with NetCDF and FITS files and I have Tika working for extracting the header text in NetCDF files but I can only get basic file metadata for FITS files. Does header text extraction not work on FITS files?

Followed this for FITS: https://wiki.apache.org/tika/TikaGDAL And am only seeing the basic file metadata not the actual text from the header.

This is what I'm using for NetCDF files (also used tika --gui to see the header text): curl -X -PUT --data-binary @age4_timeseries.nc http://localhost:9998/tika --header "Content-type: text/-t" curl -T age4_timeseries.nc http://localhost:9998/tika --header "Accept: text/plain"

I've looked through the Tika Jira and found a reference from 2012: https://issues.apache.org/jira/browse/TIKA-874

But this does not appear to have been added to Tika.

I received this from Tika:

Content-Length: 40968000
Content-Type: application/fits
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.gdal.GDALParser
X-TIKA:digest:MD5: cce03f62a68c09ec562f9e8e05b54b40
X-TIKA:digest:SHA256: b3f0c61409cbd7f2c9aeb8bdfa0798d529383db699c1055b8a12a68267b948dd
resourceName: mirc0000.fits

But was hoping to receive the content of the header like this:

SIMPLE  =                    T / file does conform to FITS standard 
BITPIX  =                   16 / number of bits per data pixel                  
NAXIS   =                    3 / number of data axes                            
NAXIS1  =                 1280 / length of data axis 1                          
NAXIS2  =                   16 / length of data axis 2                          
NAXIS3  =                 1000 / length of data axis 3                          
EXTEND  =                    T / FITS dataset may contain extensions            
COMMENT   FITS (Flexible Image Transport System) format is defined in 'AstronomyCOMMENT   and Astrophysics', volume 376, page 359; bibcode: 2001A&A...376..359H 
BZERO   =                32768 / offset data range to that of unsigned short    
BSCALE  =                    1 / default scaling factor                         
DATE    = '2006-09-01T04:01:02' / File creation date (YYYY-MM-DDThh:mm:ss UTC)  
TELESCOP= 'CHARA array 330m max baseline, 6dishes' / Telescope                  
INSTURME= 'MIRC spectro/combiner' / The data acquisition instrument             
ORIGIN  = 'Mount Wilson Institute' / Origin of the Observation                  
SITELAT = '34.13   '           / Latitude (Geodetic, VLBI, to be verified)      
SITELONG= '118.03  '           / Longitude (Geodetic, 
VLBI, to be verified)     
SITEELEV= '1742.00 '           / Altitude above MSL, to be verified             
HISTORY = 'Multi-Dish FITS data' / File modification history                    
OBJECT  = 'HD_174639'          / Target name                                    
DATE-OBS= '09/01/2006'         / UT date (YYYY-MM-DD)                           
UTC-OBS = '04:00:10'           / Universal Time hh:mm:ss                        
LST-OBS = '18:48:41'           / Local Sidereal Time hh:mm:ss                   
CHARA-TM= '04:00:11'           / CHARA time  hh:mm:ss                           
LOST-TKS= '       0'           / CHARA lost Ticks in RT Clock t                 
LOST-SEC= '       0'           / CHARA lost seconds in rt clock s               
S1-TARGE=         41.342992001 / Delay line S1 target metrology                 
S2-TARGE=         38.610911409 / Delay line S2 target metrology                 
E1-TARGE=                   0. / Delay line E1 target metrology                 
E2-TARGE=                  44. / Delay line E2 target metrology                 
W1-TARGE=                   0. / Delay line W1 target metrology                 
W2-TARGE=                   0. / Delay line W2 target metrology                 
WAVELEN =                 1.65 / Central wavelength                             
BANDWID =                  0.3 / Bandwidth of spectrum                          
EXPOSURE=             5.483692 / Effective integration time in ms               
ROWOFFS =                    5 / Sub-image Y offset prom pixel 0                
COLOFFS =                   38 / Sub-image X offset prom pixel 0                
NREADS  =                    8 / Number of multiple reads for pixel             
FRMPRST =                 1000 / Number of frames per reset                     
VOFFSET =                   4. / PICNIC offset voltage                          
VD      =                   5. / PICNIC drain bias                              
ICTL    =                  3.3 / PICNIC warm OA offset voltage                  
END             

Solution

  • Got it working! Key nugget to know, you have to have the CFITSIO library installed before building GDAL. CFITSIO library info: https://heasarc.gsfc.nasa.gov/docs/software/fitsio/fitsio.html

    Download GDAL from here: http://download.osgeo.org/gdal/CURRENT/

    gunzip

    tar xvf

    ./configure --with-cfitsio

    make

    make install

    Run Tika as usual. Now it works like a champ!