Search code examples
pythonaltair

Using color on bar chart with Altair seems to prevent zero=False on scale from having anticipated effect


The first chart from the below code (based on this: https://altair-viz.github.io/gallery/us_population_over_time_facet.html) seems to force Y-axis to not begin at zero, as anticipated. But the second chart, which includes a color in the encoding, seems to make the zero=False in alt.Scale no longer respected

Edit: forgot to mention using Altair 4.1.0

import altair as alt
from vega_datasets import data
import pandas as pd

source = data.population.url

df = pd.read_json(source)
df = df[df["age"] <= 40]

alt.Chart(df).mark_bar().encode(
    x="age:O",
    y=alt.Y(
        "sum(people):Q",
        title="Population",
        axis=alt.Axis(format="~s"),
        scale=alt.Scale(zero=False),
    ),
    facet=alt.Facet("year:O", columns=5),
).resolve_scale(y="independent").properties(
    title="US Age Distribution By Year", width=90, height=80
)

alt.Chart(df).mark_bar().encode(
    x="age:O",
    y=alt.Y(
        "sum(people):Q",
        title="Population",
        axis=alt.Axis(format="~s"),
        scale=alt.Scale(zero=False),
    ),
    facet=alt.Facet("year:O", columns=5),
    color=alt.Color("year"),
).resolve_scale(y="independent").properties(
    title="US Age Distribution By Year", width=90, height=80
)

enter image description here

enter image description here


Solution

  • This happens because the scales are automatically adjusted to show all the groups in the variable you are coloring by. It is easier to understand if we look at a single barplot with stacked colors:

    import altair as alt
    from vega_datasets import data
    import pandas as pd
    
    source = data.population.url
    
    df = pd.read_json(source)
    df = df[df["age"] <= 40]
    
    alt.Chart(df.query('year < 1880')).mark_bar().encode(
        x="age:O",
        y=alt.Y(
            "sum(people):Q",
            axis=alt.Axis(format="~s"),
            scale=alt.Scale(zero=False)),
        color=alt.Color("year"))
    

    enter image description here

    You are calculating the sum, which means that all the years are going to be somewhere in that bar stacked on top of each other. Altair / Vega-Lite expands the axis so that includes all groups in your colored variable.

    If you instead would color by age, the axis would again expand to include all the colored group, but because they are now not at the bottom of each bar, the axis is cut above zero.

    import altair as alt
    from vega_datasets import data
    import pandas as pd
    
    source = data.population.url
    
    df = pd.read_json(source)
    df = df[df["age"] <= 40]
    
    alt.Chart(df.query('year < 1880')).mark_bar().encode(
        x="age:O",
        y=alt.Y(
            "sum(people):Q",
            axis=alt.Axis(format="~s"),
            scale=alt.Scale(zero=False)),
        color=alt.Color("age"))
    

    enter image description here

    The only discrepancy is why doesn't it just show the tip of the darkest color in the first plot and cut around 2M? I am not sure about that on the top of my head.