Search code examples
pythonholoviewsdatashadergeoviews

Geoviews + Datashader is slow when projecting points


I'm plotting 550,000,000 latitudes and longitudes using datashader. But, for this to be useful, I need to overlay map tiles and polygons using geoviews. The problem is that geoviews.points() and associated projection results in a large slowdown that makes the interactive nature of holoview + bokeh plots redundant.

There's a reproducible example below, but, in short - I'm trying to make the geoviews implementation (3) fast enough to work interactively.

First setup some data

import numpy as np
import pandas as pd
import dask.dataframe as dd
import datashader as ds
import datashader.transfer_functions as tf
import holoviews as hv 
from holoviews.operation.datashader import datashade
import geopandas as gpd
import geoviews as gv

Scaling the size of the data down by 10 for example.

uk_bounding_box = (-14.02,2.09,49.67,61.06)
n = int(550000000 / 10)

# Generate some fake data of the same size
df = dd.from_pandas(
    pd.DataFrame.from_dict({
        'longitude': np.random.normal(
            np.mean(uk_bounding_box[0:2]),
            np.diff(uk_bounding_box[0:2]) / 5, n
        ),
        'latitude': np.random.normal(
            np.mean(uk_bounding_box[2:4]),
            np.diff(uk_bounding_box[2:4]) / 5, n
        )
    }), npartitions=8
)

# Persist data in memory so reading wont slow down datashader
df = df.persist()

(1) Just datashader

Just using datashader without holoviews or geo is very quick - output is rendered in 4 seconds, including aggregation, so re-renders would be event faster if interactive.

# Set some plotting params
bounds = dict(x_range = uk_bounding_box[0:2],
              y_range = uk_bounding_box[2:4])
plot_width = 400
plot_height = 300 

The time for pure datashader version:

%%time
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds)
agg = cvs.points(df, 'longitude', 'latitude', ds.count())

CPU times: user 968 ms, sys: 29.9 ms, total: 998 ms Wall time: 506 ms

tf.shade(agg)

just data shader

(2) datashader in holoviews without geoviews projection

# Set some params
sizes = dict(width=plot_width, height=plot_height)
opts = dict(bgcolor="black", **sizes)

hv.extension('bokeh')

hv.util.opts('Image Curve RGB Polygons [width=400 height=300 shared_axes=False] {+axiswise} ')

Without any projection this is comparable to using pure datashader

%%time
points = hv.Points(df, ['longitude', 'latitude']).redim.range(
    x=bounds['x_range'], y=bounds['y_range'])

shader = datashade(points, precompute=True ,**sizes).options(**opts)

CPU times: user 3.32 ms, sys: 131 µs, total: 3.45 ms Wall time: 3.47 ms

shader

holoviews render

(3) datashader in holoviews with geoviews tiles, polygons, and projection

Here's the crux of the issue - I want to align, the datashader layer with some map tiles and geospatial polygons. This results in a large slowdown that, for the size of data I'm dealing with, makes the interactive visualisation redundant. (12 minutes wait time in total for the render).

I'm sure this is to do with the overhead associated with projecting the points - is there a way to avoid this or any other workaround like precomputing the projection?

# Grab an example shape file to work with
ne_path = gpd.datasets.get_path('naturalearth_lowres')
example_shapes_df = gpd.read_file(ne_path)
uk_shape = example_shapes_df[example_shapes_df.name.str.contains('United K')]


# Grab maptiles
map_tiles = gv.tile_sources.ESRI

# In actual workflow I need to add some polygons
polys = gv.Polygons(uk_shape)

This is as above with the addition of gv.points() and projection

%%time 
points = gv.Points(df, ['longitude', 'latitude']).redim.range(
    x=bounds['x_range'], y=bounds['y_range'])

projected = gv.operation.project_points(points)

shader = datashade(projected, precompute=True ,**sizes).options(**opts)

CPU times: user 11.8 s, sys: 3.16 s, total: 15 s Wall time: 12.5 s

shader * map_tiles * polys

geoviews render


Solution

  • As suggested by @philippjfr the solution is to project the coordinates into the appropriate coordinate system and render using methods 2 or 3 above.

    Something like this:

    import cartopy
    
    def platcaree_to_mercator_vectorised(x, y):
        '''Use cartopy to convert Platecarree coords to Mercator.'''
        return(cartopy.crs.GOOGLE_MERCATOR.transform_points(
            cartopy.crs.PlateCarree(), x, y))
    
    def platcaree_for_map_partitions(pddf):
        '''Wrapper to apply mercator conversion and convert back to dataframe for Dask.'''
        as_arrays = platcaree_to_mercator_vectorised(pddf.longitude.values,pddf.latitude.values)
        as_df = pd.DataFrame.from_records(as_arrays[:, :2], columns=['longitude', 'latitude'])
        return(as_df)
    
    
    # Project the points
    df_projected = df.map_partitions(platcaree_for_map_partitions,
                                     meta={'longitude': 'f8', 'latitude': 'f8'})
    from dask.diagnostics import ProgressBar
    with ProgressBar():
        df_projected.to_parquet('abb_projected.parquet', compression='SNAPPY')
    

    Then use this projected dataset with method 2 or 3, detailed in question.