Search code examples
pythondata-visualizationdata-sciencealtairvega-lite

Using Altair on data aggregated from large datasets


I am trying to histogram counts of a large (300,000 records) temporal data set. I am for now just trying to histogram by month which is only 6 data points, but doing this with either json or altair_data_server storage makes the page crash. Is this impossible to handle well with pure Altair? I could of course preprocess in pandas, but that ruins the wonderful declarative nature of altair.

If so is this a missing feature of altair or is it out of scope? I'm learning that vegalite stores the entire underlying data and applies the transformation at run time, but it seems like altair could (and maybe does) have a way to store only the relevant data for the chart.

alt.Chart(df).mark_bar().encode(
    x=alt.X('month(timestamp):T'),
    y='count()'
)

Solution

  • Altair charts work by sending the entire dataset to your browser and processing it in the frontend; for this reason it does not work well for larger datasets, no matter how the dataset is served to the frontend.

    In cases like yours, where you are aggregating the data before displaying it, it would in theory be possible to do that aggregation in the backend, and only send aggregated data to the frontend renderer. There are some projects that hope to make this more seamless, including scalable Vega and altair-transform, but neither approach is very mature yet.

    In the meantime, I'd suggest doing your aggregations in Pandas, and sending the aggregated data to Altair to plot.

    Edit 2023-01-25: VegaFusion addresses this problem by automatically pre-aggregating the data on the server and is mature enough for production use. Version 1.0 is available under the same license as Altair.