Search code examples
pythonaltairvega-lite

Altair LOESS fit below avg values, far below linear regresion


I'm relatively new to Altair, and have run into an issue that I can't seem to understand. Basically when I'm fitting a LOESS fit to my data, the entire LOESS line is being drawn below the sample average, below averages at each time point, and below my regression fit.

The data is a panel with monthly arrest rate (part 2 crimes per 1,000 people) for number of localities.

Here's a plot with monthly average rates, a linear regression fit, and my loess. As you can see, the loess is way below all the data:

enter image description here

The code for this is:


import pandas as pd
import altair as alt

alt.data_transformers.disable_max_rows()

# Load panel data. Monthly arrest rate (part 2 crimes per 1,000 people)
# data for number of localities.

panel = pd.read_csv(
    "https://github.com/nickeubank/im_baffled/raw/main/arrest_rates.csv.zip"
)

# And if I do averages for each month, I get
# a relatively smooth downward trend.

grouped_means = panel.groupby("years_w_decimals", as_index=False)[
    ["arrest_rate"]
].mean()

chart_grouped = (
    alt.Chart(grouped_means)
    .mark_circle(opacity=0.5)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
)

reg = (
    alt.Chart(panel)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
    .transform_regression(
        "years_w_decimals",
        "arrest_rate",
        method="poly",
        order=1,
    )
    .mark_line()
)

loess = (
    alt.Chart(panel)
    .encode(
        x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
        y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
    )
    .transform_loess(on="years_w_decimals", loess="arrest_rate", bandwidth=0.3)
    .mark_line()
)
reg + chart_grouped + loess

Any chance anyone can see what's going wrong?


Solution

  • OK, after much investigation, the issue is that, as @joelostblom suggested, related out outliers.

    More specifically, looks like Vega is using a less-traditional LOESS implementation (without a lot of documentation :/): https://github.com/vega/vega-lite/issues/7686