Search code examples
pythongisrasterrasterio

Get pixel coordinates in raster from polygon boundary in coordinate reference system


I have some GeoTiff files that are relatively large (10980 x 10980 pixels), that all correspond to the same geographic area (and have the same coordinate reference system), and I have a large number of polygons (100,000+) corresponding to land parcels, and I want to extract from each image file the pixels corresponding to each polygon. Currently, the way I'm doing this is using shapely Polygons and the rasterio.mask.mask function, like this:

for filename in image_files:
    with rasterio.open(filename) as src:
        for shape in shapes:
            data, _ = rasterio.mask.mask(src, [shape], crop=True)

This is empirically rather slow. If I have the mask indices precomputed, then I just need to read each image's entire data once and then use the pre-computed indices to pull out the relevant pixels for each polygon (I don't need them to be in the correct 2-dimensional configuration, I just need the values), and this is very fast. But I don't know if there's a fast way to get these pixel indices. I know that I could use rasterio's raster_geometry_mask function to get a mask the size of the whole image, and then use numpy to get the indices of the elements inside the polygon, but then it would be needlessly constructing a 10980 x 10980 array for each polygon to make the mask, and that's very very slow.


Solution

  • What I ended up doing is, when I open the first image, then for each polygon,

    • Use the image transform to convert the polygon to pixel coordinates, and find the rectangular bounding box containing the polygon in integer pixel coordinates.
    • To figure out which pixels in the bounding box are actually in the polygon, construct shapely Polygons for each pixel and use the .intersects() method (if you wanted to only include pixels that are completely inside the polygon, you could use .contains()). (I wasn't sure if this would be slow, but it turned out not to be.)
    • Save the list of coordinate pairs for all pixels in each polygon.

    Then for every new image you open, you just read the entire image data and index out the parts for each polygon because you already have the pixel indices.

    Code looks approximately like this:

    import math
    import numpy
    import pyproj
    import rasterio.mask
    from shapely.geometry import Polygon
    
    shape_pixels = None
    for filename in image_files:
        with rasterio.open(filename) as src:
            if shape_pixels is None:
                projector = pyproj.Proj(src.crs)
                pixelcoord_shapes = [
                    Polygon(zip(*(~src.transform * numpy.array(projector(*zip(*shape.boundary.coords))))))
                    for shape in shapes
                ]
    
                pixels_per_shape = []
                for shape in shapes:
                    xmin = max(0, math.floor(shape.bounds[0]))
                    ymin = max(0, math.floor(shape.bounds[1]))
                    xmax = math.ceil(shape.bounds[2])
                    ymax = math.ceil(shape.bounds[3])
                    pixel_squares = {}
                    for j in range(xmin, xmax+1):
                        for i in range(ymin, ymax+1):
                            pixel_squares[(i, j)] = Polygon.from_bounds(j, i, j+1, i+1)
                    pixels_per_shape.append([
                        coords for (coords, pixel) in pixel_squares.items()
                        if shape.intersects(pixel)
                    ])
    
            whole_data = src.read()
    
            for pixels in pixels_per_shape:
                ivals, jvals = zip(*pixels)
                shape_data = whole_data[0, ivals, jvals]
                ...