I'm relatively new to Altair, and have run into an issue that I can't seem to understand. Basically when I'm fitting a LOESS fit to my data, the entire LOESS line is being drawn below the sample average, below averages at each time point, and below my regression fit.
The data is a panel with monthly arrest rate (part 2 crimes per 1,000 people) for number of localities.
Here's a plot with monthly average rates, a linear regression fit, and my loess. As you can see, the loess is way below all the data:
The code for this is:
import pandas as pd
import altair as alt
alt.data_transformers.disable_max_rows()
# Load panel data. Monthly arrest rate (part 2 crimes per 1,000 people)
# data for number of localities.
panel = pd.read_csv(
"https://github.com/nickeubank/im_baffled/raw/main/arrest_rates.csv.zip"
)
# And if I do averages for each month, I get
# a relatively smooth downward trend.
grouped_means = panel.groupby("years_w_decimals", as_index=False)[
["arrest_rate"]
].mean()
chart_grouped = (
alt.Chart(grouped_means)
.mark_circle(opacity=0.5)
.encode(
x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
)
)
reg = (
alt.Chart(panel)
.encode(
x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
)
.transform_regression(
"years_w_decimals",
"arrest_rate",
method="poly",
order=1,
)
.mark_line()
)
loess = (
alt.Chart(panel)
.encode(
x=alt.X("years_w_decimals", scale=alt.Scale(zero=False)),
y=alt.Y("arrest_rate", scale=alt.Scale(zero=False)),
)
.transform_loess(on="years_w_decimals", loess="arrest_rate", bandwidth=0.3)
.mark_line()
)
reg + chart_grouped + loess
Any chance anyone can see what's going wrong?
OK, after much investigation, the issue is that, as @joelostblom suggested, related out outliers.
More specifically, looks like Vega is using a less-traditional LOESS implementation (without a lot of documentation :/): https://github.com/vega/vega-lite/issues/7686