What algorithm can I use to chose which order I visit the edges of a graph to minimize the number of cache misses?

Big Picture

I'm writing some code to simulate a computer at the transistor level. The emulator boils down to a graph where each node is a transistor, and each edge is a wire connecting any two transistor nodes on the graph. This graph is cyclic, and transistor nodes may be connected to themselves.

To run a single "step" of the emulator, two separate functions are run:

Each wire edge is processed, setting the input of its target node from the output of its source node. Each wire is visited exactly once each step, but a transitor may be visited multiple times.
Each transistor node output state is updated from its input's states (how is outside the scope of this question, I'm pretty sure I'm doing it efficiently).

I believe I have the second step optimised, but I need help making the first step more efficient.

Implementation

The code looks roughly like this:

type InputBit = usize;
type OutputBit = usize;

struct Emulator {
    inputs: Vec<u64>,
    outputs: Vec<u64>,
    wires: Vec<(InputBit, OutputBit)>,
}

impl Emulator {
    fn step(&mut self) {
        step_wires(&mut self);
        step_transistors(&mut self);
    }

    fn step_wires(&mut self) {
        for (input, output) in self.wires.iter() {
            // NB omitted bit-twiddling to get bit indices
            self.outputs[output] = self.inputs[input];
        }
    }

    fn step_transistors(&mut self) {
        // ... omitted for brevity ...
    }
}

Each transistor node N is composed of two input bits at bit 2N and 2N+1 in self.inputs, and two output bits at 2N and 2N+1 in self.outputs.

The problem as I see it, is that my list of wires (and the transistors) is in arbitrary order. This means it's really cache inefficient. For example, imagine this set of wires (input node bit, output node bit):

[
    (0, 1000),
    (1000, 2000),
    (1, 1001),
    (1001, 2001),
]

If my memory cache size is < 1000 bits, that means I get a cache miss for most of the reads and writes. If they were reorganised into:

[
    (0, 1000),
    (1, 1001),
    (1000, 2000),
    (1001, 2001),
]

Then it's less cache misses. Equally, I could also "move" the transistor nodes to give the following equivalent graph:

[
    (0, 2),
    (1, 3),
    (2, 4),
    (3, 5),
]

Which now only uses one cache line! Much faster. (Note this example is slightly misleading, as the node indices will be densely packed, i.e. there wont be any "empty" node indices that are unused, but it is fine to "swap" nodes).

The Question

What algorithm can I use to chose which order I visit the wire edges, and/or reorder the transistor node indices, so that I have the minimum number of cache misses while traversing the graph?

I think something that reduces the total "distances" of all the edges would be a good start? And then something that sorts the edges so that ones which are entirely within a single cache line are visited first, in cache line order and then do between-different-cache-line edges in some order?

Solution

This is a hard problem to solve in general. The paper Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation gives a good overview of various reordering algorithms and their effects on performance and locality. The paper in general weighs in favor of Rabbit-order based algorithms and variations thereof. Tuning for your specific case will depend on cache size, data size, branching factor, partitioning behavior, etc.

However, if you're looking for something that works decently well and isn't too hard to implement, I'd recommend the Reverse Cuthill-McKee (RCM) algorithm.