I have the function get_screen
that's specified in a separate module from main.rs. It takes two 2D vectors (one that's 1920x1080 and called screen
and another one that's even larger called world
) and maps a portion of the world
vector to the screen
vector. This is the function signature when I first made it:
pub fn get_screen(
screen: &mut Vec<Vec<[u8; 4]>>,
world: &Vec<Vec<Chunk>>,
camera_coords: (isize, isize),
screen_width: usize,
screen_height: usize,
chunk_width: usize,
chunk_height: usize,
)
I had serious issues with execution time, but I optimized it from 14ms down to 3ms by using #[inline]
.
I then moved the world
vector to its own struct (alongside some other related variables like chunk width/height) and made the get_screen
function into a method in the new world
struct. This is what the function signature looked like after that change:
pub fn get_screen(
&self,
screen: &mut Vec<Vec<[u8; 4]>>,
camera_coords: (isize, isize),
screen_width: usize,
screen_height: usize,
)
Then the execution time increases back to 14ms. I've tried enabling lto=true
in Cargo.toml and switching to #[inline(always)]
to enforce it, but it seems like the compiler refuses to optimize this function the way it used to.
I attempted to remove the get_screen
method from the struct and run it as its own function like before and that seems to fix it, but only if I don't pass anything from the struct. If I attempt to pass even a usize
from the world
struct to the separate get_screen
function, then the execution time increases from 3ms back to 14ms.
To show an example of what I mean, if I pass nothing directly from the world
struct and instead pass it a cloned version of the 2D struct in world
and the hardcoded chunk_width
/chunk_height
:
gen::get_screen(
&mut screen.buf,
&cloned_world_data,
camera_coords,
SCREEN_WIDTH,
SCREEN_HEIGHT,
CHUNK_WIDTH,
CHUNK_HEIGHT,
);
It runs in 3.3ms. When I pass the usize
fields chunk_width
/chunk_height
directly from the world
struct:
gen::get_screen(
&mut screen.buf,
&cloned_world_data,
camera_coords,
SCREEN_WIDTH,
SCREEN_HEIGHT,
world.chunk_width,
world.chunk_height,
);
it takes 14.55ms to run
What's up here? How can I get my get_screen
function to compile inline while using my World
struct? Preferably I'd like to be able to re-add it to my World
struct as a method instead of keeping it separate.
Here is a minimal example:
use std::time::Instant;
const SCREEN_HEIGHT: usize = 1080; //528;
const SCREEN_WIDTH: usize = 1920; //960;
const CHUNK_WIDTH: usize = 256;
const CHUNK_HEIGHT: usize = 256;
const GEN_RANGE: isize = 25; //how far out to gen chunks
fn main() {
let batch_size = 1_000;
struct_test(batch_size);
separate_test(batch_size);
}
fn struct_test(batch_size: u32) {
let world = World::new(CHUNK_WIDTH, CHUNK_HEIGHT, GEN_RANGE); //generate world
let mut screen = vec![vec!([0; 4]; SCREEN_WIDTH); SCREEN_HEIGHT];
let camera_coords: (isize, isize) = (0, 0); //set camera location
let start = Instant::now();
for _ in 0..batch_size {
get_screen(
&mut screen,
&world.data,
camera_coords,
SCREEN_WIDTH,
SCREEN_HEIGHT,
world.chunk_width,
world.chunk_height,
); //gets visible pixels from world as 2d vec
}
println!(
"struct: {:?} {:?}",
start.elapsed(),
start.elapsed() / batch_size
);
}
fn separate_test(batch_size: u32) {
let world = World::new(CHUNK_WIDTH, CHUNK_HEIGHT, GEN_RANGE); //generate world
let cloned_world_data = world.data.clone();
let mut screen = vec![vec!([0; 4]; SCREEN_WIDTH); SCREEN_HEIGHT];
let camera_coords: (isize, isize) = (0, 0); //set camera location
let start = Instant::now();
for _ in 0..batch_size {
get_screen(
&mut screen,
&cloned_world_data,
camera_coords,
SCREEN_WIDTH,
SCREEN_HEIGHT,
CHUNK_WIDTH,
CHUNK_HEIGHT,
); //gets visible pixels from world as 2d vec
}
println!(
"separate: {:?} {:?}",
start.elapsed(),
start.elapsed() / batch_size
);
}
///gets all visible pixels on screen relative camera position in world
#[inline(always)] //INLINE STOPPED WORKING??
pub fn get_screen(
screen: &mut Vec<Vec<[u8; 4]>>,
world: &Vec<Vec<Chunk>>,
camera_coords: (isize, isize),
screen_width: usize,
screen_height: usize,
chunk_width: usize,
chunk_height: usize,
) {
let camera = get_local_coords(&world, camera_coords, chunk_width, chunk_height); //gets loaded coords of camera in loaded chunks
(camera.1 - screen_height as isize / 2..camera.1 + screen_height as isize / 2)
.enumerate()
.for_each(|(py, y)| {
//for screen pixel index and particle in range of camera loaded y
let (cy, ly) = get_local_pair(y, chunk_height); //calculate chunk y and inner y from loaded y
if let Some(c_row) = world.get(cy) {
//if chunk row at loaded chunk y exists
(camera.0 - screen_width as isize / 2..camera.0 + screen_width as isize / 2)
.enumerate()
.for_each(|(px, x)| {
//for screen pixel index and particle in range of camera loaded x
let (cx, lx) = get_local_pair(x, chunk_width); //get loaded chunk x and inner x from loaded x
if let Some(c) = c_row.get(cx) {
screen[py][px] = c.data[ly][lx];
}
//if chunk in row then copy color of target particle in chunk
else {
screen[py][px] = [0; 4]
} //if target chunk doesn't exist color black
})
} else {
screen[py].iter_mut().for_each(|px| *px = [0; 4])
} //if target chunk row doesn't exist color row black
});
}
///calculates local coordinates in world vec from your global position
///returns negative if above/left of rendered area
pub fn get_local_coords(
world: &Vec<Vec<Chunk>>,
coords: (isize, isize),
chunk_width: usize,
chunk_height: usize,
) -> (isize, isize) {
let (wx, wy) = world[0][0].chunk_coords; //gets coords of first chunk in rendered vec
let lx = coords.0 - (wx * chunk_width as isize); //calculates local x coord based off world coords of first chunk
let ly = (wy * chunk_height as isize) - coords.1; //calculates local y coord based off world coords of first chunk
(lx, ly)
}
pub fn get_local_pair(coord: isize, chunk: usize) -> (usize, usize) {
(coord as usize / chunk, coord as usize % chunk)
}
///contains chunk data
#[derive(Clone)]
pub struct Chunk {
//world chunk object
pub chunk_coords: (isize, isize), //chunk coordinates
pub data: Vec<Vec<[u8; 4]>>, //chunk Particle data
}
impl Chunk {
///generates chunk
fn new(chunk_coords: (isize, isize), chunk_width: usize, chunk_height: usize) -> Self {
let data = vec![vec!([0; 4]; chunk_width); chunk_height];
Self { chunk_coords, data }
}
}
pub struct World {
pub data: Vec<Vec<Chunk>>,
pub chunk_width: usize,
pub chunk_height: usize,
}
impl World {
pub fn new(chunk_width: usize, chunk_height: usize, gen_range: isize) -> Self {
let mut data = Vec::new(); //creates empty vec to hold world
for (yi, world_chunk_y) in (gen_range * -1..gen_range + 1).rev().enumerate() {
//for y index, y in gen range counting down
data.push(Vec::new()); //push new row
for world_chunk_x in gen_range * -1..gen_range + 1 {
//for chunk in gen range of row
data[yi].push(Chunk::new(
(world_chunk_x, world_chunk_y),
chunk_width,
chunk_height,
)); //gen new chunk and put it there
}
}
Self {
data,
chunk_width,
chunk_height,
}
}
}
Probably, when you use world.chunk_width
and world.chunk_height
as parameters the compiler does not consider these parameters as constants and then actually generates division and modulus operations.
On the other hand, when you provide constants for these parameters, they can be propagated in the algorithm (constant folding) and some expensive operations (division, modulus) are not performed (or transformed into bit-shifts/masks).
Copying/pasting your code in godbolt (compiler explorer), making separate_test()
and struct_test()
public, and compiling with -C opt-level=3
confirms this since div
instructions are present in the generated code for struct_test()
but not for separate_test()
.