Search code examples
type-conversiontiffgdalosgeoxyz

Conversion problem: Large TIFF File to XYZ Format


I am working with a TIFF file named CHELSAcruts_tmax_4_1981_V.1.0.tif (size: 97M). I want to convert this file first to XYZ format and subsequently to CSV.

Attempts and Challenges Online Converter Limitation: Initially, I attempted to use an online converter, but the file size exceeded the allowed limit.

Python Script for XYZ Conversion: I found the following Python code snippet for conversion to XYZ format:

from osgeo import gdal
ds = gdal.Open("CHELSAcruts_tmax_4_1981_V.1.0.tif")
ds.GetGeoTransform()
xyz = gdal.Translate("dem.xyz", ds)

However, this approach has two major issues:

It takes an excessively long time to run. It generates a file larger than my VM's free disk space (over 100 GB). Observation of Zero Values: Upon closer inspection, I noticed the file contains many zero values, such as:

X Y Z
-179.99597222220001 83.9956937485000168 -32768
-179.987638888900022 83.9956937485000168 -32768

documentation: https://gdal.org/api/python/osgeo.gdal.html Attempt to Filter Zero Values: To address this, I tried filtering these values with the following code, but it was unsuccessful:

band = ds.GetRasterBand(1)
band.SetNoDataValue(-32768)
gdal.Translate("dem.xyz", ds, noData=-32768, creationOptions=["ADD_HEADER_LINE=YES"])

Using the code below, I found the image array's shape is (20880, 43200), which might be contributing to the file size issue.

from osgeo import gdal
ds = gdal.Open('mytif.tif', gdal.GA_ReadOnly)
rb = ds.GetRasterBand(1)
img_array = rb.ReadAsArray()

I tried to clear zero values with code below:

import gdal_calc
import numpy as np

original_tiff = "/content/drive/My Drive/Kalmia/CHELSAcruts_tmax_4_1981_V.1.0.tif"
modified_tiff = "/content/drive/My Drive/Kalmia/modified_tiff.tif"

# Use gdal_calc.py to replace -32768 with NaN
gdal_calc.Calc("A*(A!=-32768) + nan*(A==-32768)", A=original_tiff, outfile=modified_tiff, NoDataValue=np.nan)

but it also works forever and gave no result

https://drive.google.com/file/d/1DPpxNFgcE3d4W-7AemBN_zTtylibI9Na/view?usp=sharing

Given the large size, how can I efficiently convert it to XYZ and then to CSV format, while managing the size and zero-value issues?


Solution

  • Doing this doesn't make any sense to me for several reasons. The XYZ/CSV format will store the coordinates (as strings!) for every valid pixel, which is completely redundant for a regular grid like this. The exact same information can be retrieved using for example the geotransform (6 values) and file dimensions (2 values).

    Storing it as string, uncompressed completely blows up the volume of data. Using a different format like (geo)parquet would already alleviate a lot of that.

    Regardless, the snippet below shows how it can be done anyway. This does it all at once, so you'll need the memory to do so. It's of course possible to adapt the code and read the data in smaller chunks, and incrementally write them to the output file.

    Using Pandas in this case, for convenience, probably also introduces it's own overhead on top of this. For a simple file format like this doing it manually is also feasible.

    from osgeo import gdal
    import numpy as np
    import pandas as pd
    

    Read raster data and metadata

    fn = "CHELSAcruts_tmax_4_1981_V.1.0.tif"
    
    ds = gdal.Open(fn)
    
    gt = ds.GetGeoTransform()
    
    data = ds.ReadAsArray()
    ys, xs = data.shape
    
    ds = None
    

    Filter nodata and create coordinates

    valid = data != -32768
    valid.sum(), valid.size
    
    ulx, uly = gdal.ApplyGeoTransform(gt, 0.5, 0.5)
    lrx, lry = gdal.ApplyGeoTransform(gt, xs-0.5, ys-0.5)
    
    lats, lons = np.mgrid[uly:lry:gt[5], ulx:lrx+gt[1]:gt[1]]
    
    # reduce memory a little...
    lons = lons.astype(np.float32)
    lats = lats.astype(np.float32)
    

    write to csv

    df = pd.DataFrame(dict(
        X=lons[valid],
        Y=lats[valid],
        Z=data[valid],
    ))
    
    df.to_csv("CHELSAcruts_tmax_4_1981_V.1.0.csv", index=False)
    

    As said, I would consider another approach where you at least store the data as binary, compressed, and without the redundant coordinates.