Why does the #[inline] attribute stop working when a function is moved to a method on a struct?

I have the function get_screen that's specified in a separate module from main.rs. It takes two 2D vectors (one that's 1920x1080 and called screen and another one that's even larger called world) and maps a portion of the world vector to the screen vector. This is the function signature when I first made it:

pub fn get_screen(
    screen: &mut Vec<Vec<[u8; 4]>>,
    world: &Vec<Vec<Chunk>>,
    camera_coords: (isize, isize),
    screen_width: usize,
    screen_height: usize,
    chunk_width: usize,
    chunk_height: usize,
)

I had serious issues with execution time, but I optimized it from 14ms down to 3ms by using #[inline].

I then moved the world vector to its own struct (alongside some other related variables like chunk width/height) and made the get_screen function into a method in the new world struct. This is what the function signature looked like after that change:

pub fn get_screen(
    &self,
    screen: &mut Vec<Vec<[u8; 4]>>,
    camera_coords: (isize, isize),
    screen_width: usize,
    screen_height: usize,
)

Then the execution time increases back to 14ms. I've tried enabling lto=true in Cargo.toml and switching to #[inline(always)] to enforce it, but it seems like the compiler refuses to optimize this function the way it used to.

I attempted to remove the get_screen method from the struct and run it as its own function like before and that seems to fix it, but only if I don't pass anything from the struct. If I attempt to pass even a usize from the world struct to the separate get_screen function, then the execution time increases from 3ms back to 14ms.

To show an example of what I mean, if I pass nothing directly from the world struct and instead pass it a cloned version of the 2D struct in world and the hardcoded chunk_width/chunk_height:

gen::get_screen(
    &mut screen.buf,
    &cloned_world_data,
    camera_coords,
    SCREEN_WIDTH,
    SCREEN_HEIGHT,
    CHUNK_WIDTH,
    CHUNK_HEIGHT,
);

It runs in 3.3ms. When I pass the usize fields chunk_width/chunk_height directly from the world struct:

gen::get_screen(
    &mut screen.buf,
    &cloned_world_data,
    camera_coords,
    SCREEN_WIDTH,
    SCREEN_HEIGHT,
    world.chunk_width,
    world.chunk_height,
);

it takes 14.55ms to run

What's up here? How can I get my get_screen function to compile inline while using my World struct? Preferably I'd like to be able to re-add it to my World struct as a method instead of keeping it separate.

Here is a minimal example:

use std::time::Instant;

const SCREEN_HEIGHT: usize = 1080; //528;
const SCREEN_WIDTH: usize = 1920; //960;
const CHUNK_WIDTH: usize = 256;
const CHUNK_HEIGHT: usize = 256;

const GEN_RANGE: isize = 25; //how far out to gen chunks

fn main() {
    let batch_size = 1_000;
    struct_test(batch_size);
    separate_test(batch_size);
}

fn struct_test(batch_size: u32) {
    let world = World::new(CHUNK_WIDTH, CHUNK_HEIGHT, GEN_RANGE); //generate world
    let mut screen = vec![vec!([0; 4]; SCREEN_WIDTH); SCREEN_HEIGHT];
    let camera_coords: (isize, isize) = (0, 0); //set camera location

    let start = Instant::now();
    for _ in 0..batch_size {
        get_screen(
            &mut screen,
            &world.data,
            camera_coords,
            SCREEN_WIDTH,
            SCREEN_HEIGHT,
            world.chunk_width,
            world.chunk_height,
        ); //gets visible pixels from world as 2d vec
    }
    println!(
        "struct:   {:?} {:?}",
        start.elapsed(),
        start.elapsed() / batch_size
    );
}

fn separate_test(batch_size: u32) {
    let world = World::new(CHUNK_WIDTH, CHUNK_HEIGHT, GEN_RANGE); //generate world
    let cloned_world_data = world.data.clone();
    let mut screen = vec![vec!([0; 4]; SCREEN_WIDTH); SCREEN_HEIGHT];
    let camera_coords: (isize, isize) = (0, 0); //set camera location

    let start = Instant::now();
    for _ in 0..batch_size {
        get_screen(
            &mut screen,
            &cloned_world_data,
            camera_coords,
            SCREEN_WIDTH,
            SCREEN_HEIGHT,
            CHUNK_WIDTH,
            CHUNK_HEIGHT,
        ); //gets visible pixels from world as 2d vec
    }
    println!(
        "separate: {:?} {:?}",
        start.elapsed(),
        start.elapsed() / batch_size
    );
}

///gets all visible pixels on screen relative camera position in world
#[inline(always)] //INLINE STOPPED WORKING??
pub fn get_screen(
    screen: &mut Vec<Vec<[u8; 4]>>,
    world: &Vec<Vec<Chunk>>,
    camera_coords: (isize, isize),
    screen_width: usize,
    screen_height: usize,
    chunk_width: usize,
    chunk_height: usize,
) {
    let camera = get_local_coords(&world, camera_coords, chunk_width, chunk_height); //gets loaded coords of camera in loaded chunks
    (camera.1 - screen_height as isize / 2..camera.1 + screen_height as isize / 2)
        .enumerate()
        .for_each(|(py, y)| {
            //for screen pixel index and particle in range of camera loaded y
            let (cy, ly) = get_local_pair(y, chunk_height); //calculate chunk y and inner y from loaded y
            if let Some(c_row) = world.get(cy) {
                //if chunk row at loaded chunk y exists
                (camera.0 - screen_width as isize / 2..camera.0 + screen_width as isize / 2)
                    .enumerate()
                    .for_each(|(px, x)| {
                        //for screen pixel index and particle in range of camera loaded x
                        let (cx, lx) = get_local_pair(x, chunk_width); //get loaded chunk x and inner x from loaded x
                        if let Some(c) = c_row.get(cx) {
                            screen[py][px] = c.data[ly][lx];
                        }
                        //if chunk in row then copy color of target particle in chunk
                        else {
                            screen[py][px] = [0; 4]
                        } //if target chunk doesn't exist color black
                    })
            } else {
                screen[py].iter_mut().for_each(|px| *px = [0; 4])
            } //if target chunk row doesn't exist color row black
        });
}

///calculates local coordinates in world vec from your global position
///returns negative if above/left of rendered area
pub fn get_local_coords(
    world: &Vec<Vec<Chunk>>,
    coords: (isize, isize),
    chunk_width: usize,
    chunk_height: usize,
) -> (isize, isize) {
    let (wx, wy) = world[0][0].chunk_coords; //gets coords of first chunk in rendered vec
    let lx = coords.0 - (wx * chunk_width as isize); //calculates local x coord based off world coords of first chunk
    let ly = (wy * chunk_height as isize) - coords.1; //calculates local y coord based off world coords of first chunk
    (lx, ly)
}

pub fn get_local_pair(coord: isize, chunk: usize) -> (usize, usize) {
    (coord as usize / chunk, coord as usize % chunk)
}

///contains chunk data
#[derive(Clone)]
pub struct Chunk {
    //world chunk object
    pub chunk_coords: (isize, isize), //chunk coordinates
    pub data: Vec<Vec<[u8; 4]>>,      //chunk Particle data
}

impl Chunk {
    ///generates chunk
    fn new(chunk_coords: (isize, isize), chunk_width: usize, chunk_height: usize) -> Self {
        let data = vec![vec!([0; 4]; chunk_width); chunk_height];
        Self { chunk_coords, data }
    }
}

pub struct World {
    pub data: Vec<Vec<Chunk>>,
    pub chunk_width: usize,
    pub chunk_height: usize,
}

impl World {
    pub fn new(chunk_width: usize, chunk_height: usize, gen_range: isize) -> Self {
        let mut data = Vec::new(); //creates empty vec to hold world
        for (yi, world_chunk_y) in (gen_range * -1..gen_range + 1).rev().enumerate() {
            //for y index, y in gen range counting down
            data.push(Vec::new()); //push new row
            for world_chunk_x in gen_range * -1..gen_range + 1 {
                //for chunk in gen range of row
                data[yi].push(Chunk::new(
                    (world_chunk_x, world_chunk_y),
                    chunk_width,
                    chunk_height,
                )); //gen new chunk and put it there
            }
        }
        Self {
            data,
            chunk_width,
            chunk_height,
        }
    }
}

Solution

Probably, when you use world.chunk_width and world.chunk_height as parameters the compiler does not consider these parameters as constants and then actually generates division and modulus operations.

On the other hand, when you provide constants for these parameters, they can be propagated in the algorithm (constant folding) and some expensive operations (division, modulus) are not performed (or transformed into bit-shifts/masks).

Copying/pasting your code in godbolt (compiler explorer), making separate_test() and struct_test() public, and compiling with -C opt-level=3 confirms this since div instructions are present in the generated code for struct_test() but not for separate_test().